Getting Started with Bayesian Statistics

using Stan and Python

Author

Bob Carpenter

Published

June 2, 2023

Preface

Welcome to this introduction to Bayesian statistics using Stan in Python. The preface explains what we expect you to know before starting, how to install Stan, and provides the Python boilerplate we will use throughout.

Prerequisites

We will assume the reader will be able to follow text that includes basic notions from

  • differential and integral calculus in multiple dimensions,
  • matrix arithmetic (but not linear algebra),
  • probability theory, including probability density and mass functions, cumulative distribution functions, expectations, events, and the basic rules of probability theory, and
  • Python numerical programming with NumPy.

By basics, we really do mean basics. You won’t need to do any calculus, we will just use it to express what Stan computes. Similarly, we will use matrix notation to express models, and we will avoid advanced topics in linear algebra that are at the heart of some of Stan’s internals.

We include two appendices as both background and summary of notation, with rigorous definitions of the mathematics used in this introduction. Those who are more mathematically inclined may wish to start with the appendices.

Source code and license

All of the source markdown, YAML, LaTeX, and BibTeX files are available in

Everything is open source, with licenses:

  • Code: BSD-3-Clause license
  • Text: CC-BY 4.0

Python, CmdStanPy, NumPy, pandas, and plotnine

For scripting language, we use Python 3.

To access Stan, we use the Python package CmdStanPy.

For numerical and statistical computation in Python, we use NumPy.

For plotting, we use the Python package plotnine. plotnine is a Python reimplementation of ggplot2, which is itself an implementation of the grammar of graphics (Wilkinson 2005).

We use pandas for representing wide-form data frames, primarily because it is the required input for plotnine.

Python boilerplate

We include the following Python boilerplate to import and configure packages we will use throughout this tutorial.

# PROJECT SETUP
# set DRAFT = False for final output; DRAFT = True is faster
DRAFT = False

import itertools
import logging
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings( "ignore", module = "plotnine\..*" )

import cmdstanpy as csp
csp.utils.get_logger().setLevel(logging.ERROR)

import numpy as np
import statistics as stat
import pandas as pd
import plotnine as pn
import patchworklib as pw

def mydraw(x):
    """draw plot in the quarto doc"""
    x.draw()

Acknowledgements

I’d like to thank Mitzi Morris, Barry Smith, and Brian Ward for helpful feedback on earlier drafts. Thanks to ChatGPT Plus 4.0 (May 5 and May 12, 2023 versions) for extensive help on plotting and data frame construction and rewriting some passages I couldn’t manage to write clearly. I’d also like to thank the entire Stan Development Team and all of our users. Stan is an open project and a large team effort and we’re always happy to help onboard new developers and new users.

1 Introduction

These notes are intended to introduce several technical topics to practitioners: Bayesian statistics and probabilistic modeling, Markov chain Monte Carlo methods for Bayesian inference, and the Stan probabilistic programming language.

1.1 Bayesian statistics

The general problem addressed by statistical inference is that of reasoning from a limited number of noisy observations. For example, we might want to perform inference about a population after measuring a subset of its members, or we might want to predict future events after observing past events.

There are many approaches to applied statistics. These notes focus on Bayesian statistics, a form of statistical modeling and inference that is grounded in probability theory. In the Bayesian approach to statistics, we characterize our knowledge of the world in terms of probabilities (e.g., there is a 24.3% chance of rain after lunch today, the probability that the next baby born in the United states is male is 51%).

Bayesian inference is always carried out with respect to a mathematical model of a stochastic data generating process. If the model is well-specified in the sense of matching the true data generating process, then Bayesian statistical inference can be shown to have several desirable properties, such as calibration and resistance to overfitting.

Appendix B provides a short, but precise introduction to Bayesian inference, following Appendix A, which reviews the basics of probability theory. This establishes both the notation and provides more formal definitions of exactly what Stan is computing.

If you’re looking for a gentle introduction to Bayesian statistics, I highly recommend Statistical Rethinking (McElreath 2023). For a more advanced introduction, try Bayesian Data Analysis (Gelman et al. 2013), which is available from the authors as a free pdf.

1.2 Markov chain Monte Carlo methods

Bayesian inference for parameter estimation, prediction, or event probability estimation is based on posterior expectations. A posterior expectation is a high dimensional integral over the space of parameters. Stan adopts the standard approach to solving general high-dimensional integrals, which is the Monte Carlo method. Monte Carlo methods use random sampling (hence the name) to solve high-dimensional integrals (which is not itself a random quantity).

We cannot use standard Monte Carlo methods for most problems of interest in Bayesian statistics because we cannot generate a sample of independent draws from the posterior density of interest.1 The exception is simple models in the exponential family with conjugate priors (Diaconis and Ylvisaker 1979). So instead, we have to resort to Markov chain Monte Carlo (MCMC) methods (Brooks et al. 2011), which create samples with correlation structure among the draws making up the sample.

Alternatives to MCMC include rejection sampling (Gilks and Wild 1992), sequential Monte Carlo (Doucet, De Freitas, and Gordon 2001), approximate Bayesian computation (ABC) (Marin et al. 2012), variational inference (Blei, Kucukelbir, and McAuliffe 2017), and nested Laplace approximation (Rue, Martino, and Chopin 2009), among others.

Among MCMC methods, Stan adopts Hamiltonian Monte Carlo (HMC) (Neal 2011), which is currently the most efficient and scalable MCMC method for smooth target densities. Popular alternatives include random-walk Metropolis-Hastings (Chib and Greenberg 1995) and Gibbs sampling (Casella and George 1992), both of which are simpler, but much less efficient than HMC for all but the easiest of problems.

1.3 Stan and probabilistic programming

Stan is what is known as a domain specific language (DSL), meaning it was written for particular application. Specifically, Stan is a probabilistic programming language (PPL) designed for coding statistical models. Although Stan is used most widely for Bayesian inference, it can also perform standard frequentist inference (e.g., maximum likelihood, bootstrap, etc.), though we do not touch on those capabilities in this introduction.

A Stan program declares data variables and parameters, along with a differentiable posterior log density (of the parameters given the data) up to a constant. Although this is the only requirement, most Stan programs define the joint log density of the parameters and variables, which can be shown to be equal to the log posterior plus a constant.

Stan programs are probabilistic programs in the sense that their data and parameters can represent random variables. One way to think of a random variable in the context of Stan is as a variable that takes its value randomly (though in fact, it takes its value pseudorandomly based on a random seed). This means that by default, running a Stan program twice will give different answers. Stan programs can be made deterministic in order to provide reproducible results by fixing a random seed.

Stan programs are translated to C++ and we estimate quantities of interest using either MCMC or approximate methods like variational inference or Laplace approximation. Users provide Stan data and access Stan output through analytics languages like Python, R, or Julia.

2 Pragmatic Bayesian statistics

There have been several schools of Bayesian statisticians, and Lin (2022) provides an excellent overview with primary references and Little (2006) provides a more in-depth summary comparing to frequentist philosophy. The two most prominent schools are the subjective Bayesians and the objective Bayesians. As suggested by the names, these two paradigms have diametrically opposed philosophical approaches. While both use proper priors in the sense of being probability distributions, the “subjective” approach uses as much prior “belief,” whereas the “objective” approach uses as little prior information as is possible. Both these groups then trust their inferences based on their chosen philosophical approach.

We are going to follow a more pragmatic approach to Bayesian statistics that views model building as more of an engineering discipline than a philosophical exercise. This perspective is laid out in detail in Gelman et al. (2013) and refined in Gelman et al. (2020). The pragmatic approach feels more like modern machine learning than statistics, with its emphasis on predictive calibration (which is itself a frequentist notion when applied to future data) Gneiting, Balabdaoui, and Raftery (2007) and its willingness to modify a statistical model based on its behavior on the data (Gelman et al. 2013).

The fundamental distinguishing feature of frequentist statistics is that probabilities are understood as long-run frequencies of repeatable processes. This prohibits placing probability distributions over parameters, because there is no long-term repeatable process in the world generating new parameters. For example, the gravitational constant has a single value and is not the value of a potentially repeatable trial like a coin flip (other than in a philosophical, possible worlds sense).

In our pragmatic approach to Bayesian statistics, we treat probability as fundamentally epistemic rather than deontic, meaning it is about human knowledge, not about human belief. This is a subtle, but important distinction. Although frequentists sometimes worry that Bayesians are “putting their thumb on the scale” by including their prior knowledge in a model rather than “letting the data speak for itself,” this is an instance of the pot calling the kettle black. The biggest “subjective” decision in model building is shared between Bayesian and frequentist approaches, namely the likelihood assumed to model the data-generating process. In practice, we often sidestep the concerns of subjectivity by using weakly informative priors that indicate the scale, but not the particular value, of a prior. And we furthermore run sensitivity analyses to test the effect of our prior assumptions.

Pierre-Simon Laplace (1814) begins his book on probability by stating the general epistemic position on probability in terms of an entity that knows all (aka Laplace’s demon).

We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect were also vast enough to submit these data to analysis, it would embrace in a single formula the movements of the greatest bodies of the universe and those of the tiniest atom; for such an intellect nothing would be uncertain and the future just like the past would be present before its eyes.

John Stuart Mill (1882) is more explicit in laying out the epistemic view of probability as follows.

We must remember that the probability of an event is not a quality of the event itself, but a mere name for the degree of ground which we, or some one else, have for expecting it. \(\ldots\) Every event is in itself certain, not probable; if we knew all, we should either know positively that it will happen, or positively that it will not. But its probability to us means the degree of expectation of its occurrence, which we are warranted in entertaining by our present evidence.

3 Stan examples: forward simulation and Monte Carlo

By forward simulation, we mean running a simulation of a scientific process forward from the parameter values to simulated data. For example, consider a simple clinical trial with \(N\) subjects and a probability \(\theta \in (0, 1)\) of a positive outcome. Given \(\theta\) and \(N\), we can simulate the number of patients \(y \in 0{:}N\) with a successful outcome according to a binomial distribution (which we define below).

The inverse problem is that of estimating the probability of success \(\theta,\) given an observation of \(y\) successes out of \(N\) subjects. For example, we might have \(N = 100\) subjects in the trial, \(y = 32\) of whom had a positive outcome from the trial. A simple estimate of \(\theta\) in this case would be 0.32. We return to estimation and uncertainty quantification in later sections.

Let’s say we have \(N = 100\) subjects in our clinical trial and the success rate is \(\theta = 0.3\). We can simulate a result \(y\) from the clinical trial by randomly generating the number of subjects with a successful outcome. Although this could be done by simulating the binary outcome for each patient, it wouldn’t be an efficient way to sample from a binomial distribution.

In statistical sampling notation, we write \[ Y \sim \textrm{binomial}(N, \theta) \] to indicate that there are \(N \in \mathbb{N}\) patients with probability \(\theta \in (0, 1)\) of a successful outcome, with \(Y \in 0{:}N\) representing the number of successful outcomes out of \(N\) patients.

The probability mass function function for \(Y\), written \(p_Y\), is defined for \(N \in \mathbb{N}\), \(\theta \in (0, 1)\), and \(y \in 0{:}N\) by \[\begin{align} p_Y(y \mid N, \theta) &= \textrm{binomial}(y \mid N, \theta) \\[6pt] &= \binom{N}{y} \cdot \theta^y \cdot (1 - \theta)^{N - y}. \end{align}\] Unless necessary for disambiguation, we will drop the random variable subscripts on probability density or mass functions like \(p_Y\) going forward, writing simply \(p(y \mid N, \theta)\) and allowing context to disambiguate.

3.1 A first Stan program

Let’s say we wanted to generate random instantiations of \(Y\) for given values of \(N\) and \(\theta\). We can do that using the following Stan program, which we will unpack line by line after its listing.

stan/binomial-rng.stan
data {
  int<lower=0> N;
  real<lower=0, upper=1> theta;
}
generated quantities {
  int<lower=0, upper=N> y = binomial_rng(N, theta);
}

The first thing to notice is that a Stan program is organized into blocks. Here we have two blocks, a data block containing declarations of variables that must be input as data, and a generated quantities block, which not only declares variables, but assigns a value to them. In the case of this Stan program, the generated quantity variable y is assigned the result of taking a single draw from a \(\textrm{binomial}(N, \theta)\) distribution, which Stan provides through the binomial_rng function.

The second thing to notice about a Stan program is that the variables are all declared with types. Stan uses static typing, which means that unlike Python or R, a variable’s type is declared in the program before it is used rather than determined at run time based on what is assigned to it. Once declared, a variable’s type never changes. Stan also uses strong typing, meaning that unlike C or C++, there is no way to get around the type restrictions and access memory directly.

The program declares three variables, N and y of type int (integer values in \(\mathbb{Z}\)), and theta of type real (real values in \(\mathbb{R}\)). On actual computers, our integers will have fixed upper and lower bounds and our real numbers are subject to all the vagaries of numerical floating point calculations. Stan uses double-precision (64-bit) floating point and follows the IEEE 754 standard other than in a few highly-optimized calculations that lose a few bits of precision.

A type may also have constraints. Because N is a count, it must be greater than or equal to zero, which we indicate with the bound lower=0. Similarly, the variable y is the number of successes out of N trials, so it must take on a value between 0 and N (inclusive); that is represented with the constraint lower=0, upper=N. Finally, the variable theta is real and declared to fall in the interval \([0, 1]\) with the constraint lower=0, upper=1. Technically, our bounds are open for real values, but in practice, we might wind up with 0 or 1 values due to underflow or rounding errors in floating point arithmetic.

At run time, the compiled Stan program must be given values for N and theta, at which point, each iteration it will sample a value of y using its built-in pseudorandom number generator. In code, we first define a dictionary for our data (variables N and theta), then construct an instance of CmdStanModel for our model from the path to its program, and finally sample from the model using the sample method of CmdStanModel.

N = 100
theta = 0.3
data = {'N': N, 'theta': theta}
model = csp.CmdStanModel(stan_file = '../stan/binomial-rng.stan')
sample = model.sample(data = data, seed = 123, chains = 1,
                      iter_sampling = 10, iter_warmup = 0,
                      show_progress = False, show_console = False)

The Python boilerplate above set the log level to WARNING for the cmdstanpy package in order to get rid of the information-level messages that would otherwise provide updates on a running Stan program. Changing the log level to INFO provides more information as a Stan program runs, and level DEBUG provides even more.

The constructor for CmdStanModel is used to construct a model from a Stan program found in the specified file. We highly recommend using a standalone file for Stan programs to make them easy to share, to allow both quotes for printing and apostrophes for transposition, and to make it easy to find the lines referenced by number in error messages. Under the hood, this first runs Stan’s transpiler to convert the Stan program to a C++ class. Then it compiles the C++ program, which will take on the order of twenty seconds due to the heavy use of optimization and template metaprograms.

For the sample() method of the CmdStanModel object model, we provide arguments

  • data: the data read in the data block of the Stan program,
  • seed: pseudorandom number generator for reproducibility,
  • chains: the number of simulation runs (parallel_chains indicates how many to run in parallel),
  • iter_sampling: number of draws (i.e., sample size) to return,
  • iter_warmup: number of warmup iterations to tune parameters of the sampling algorithm (not needed here, so set to 0),
  • show_progress: if True, print progress updates, and
  • show_console: pop up a GUI progress monitor.

The sample() of the compiled model class returns a sample consisting of the specified number of random draws (10 here). The way this works here is that when sample() is called, CmdStan runs Stan as a standalone C++ program in a background process. This program starts by copying he data to a file, then reading the data file in to construct a C++ object representing the model. Then for this program, with only a generated quantities block, for each draw (the number of which is specified by iter_sampling) Stan runs a pseudorandom number generator to generate a draw from the specified binomial distribution. Random number generation is done with respect to the seed value specified in the call. For more details on how pseudorandom number generation is performed; see the (free online) book by Devroye (1986) for more information. We describe the operational semantics of Stan in more detail in the section on Stan’s execution model below.

Once sampling has executed, we can extract the sample for the scalar variable y as an array and then print them along with our other variables.

y = sample.stan_variable('y')
print("N = ", N, ";  theta = ", theta, ";  y(0:10) =", *y.astype(int))
N =  100 ;  theta =  0.3 ;  y(0:10) = 33 36 30 28 37 26 25 19 29 33

Let’s put that in a loop and see what it looks like for for N equal to 10, 100, 1000, and 10,000.

for N in [10, 100, 1_000, 10_000]:
    data = {'N': N, 'theta': theta}
    sample = model.sample(data = data, seed = 123, chains = 1,
                          iter_sampling = 10, iter_warmup = 0,
                          show_progress = False,
              show_console = False)
    y = sample.stan_variable('y')
    print("N =", N)
    print("  y: ", *y.astype(int))
    print("  est. theta: ", *(y / N))
N = 10
  y:  4 4 3 4 3 3 3 5 2 4
  est. theta:  0.4 0.4 0.3 0.4 0.3 0.3 0.3 0.5 0.2 0.4
N = 100
  y:  33 36 30 28 37 26 25 19 29 33
  est. theta:  0.33 0.36 0.3 0.28 0.37 0.26 0.25 0.19 0.29 0.33
N = 1000
  y:  322 324 306 333 311 318 294 323 282 311
  est. theta:  0.322 0.324 0.306 0.333 0.311 0.318 0.294 0.323 0.282 0.311
N = 10000
  y:  3049 3052 3012 3025 3042 3087 3051 2922 2943 3025
  est. theta:  0.3049 0.3052 0.3012 0.3025 0.3042 0.3087 0.3051 0.2922 0.2943 0.3025

On the first line for \(N = 10\) trials, our simple frequency-based estimates range from 0.2 to 0.5. By the time we have 10,000 trials, the frequency-based estimates only vary between 0.292 and 0.309. We know from the central limit theorem that the spread of estimates is expected to shrink at a rate of \(\mathcal{O}(1 / \sqrt{N})\) for \(N\) draws (this result is only asymptotic in \(N\), but is very close for large-ish \(N\) in practice).

It is hard to scan these results. Let’s repeat simulate 100,000 \(y\) values for each value of \(N\) and plot histograms. The following histogram plots the distribution of estimates based on 10, 100, and 1000 observations over 100,000 simulated trials.

np.random.seed(123)
ts = []
ps = []
theta = 0.3
M = 100 if DRAFT else 100_000
for N in [10, 100, 1_000]:
    data = {'N': N, 'theta': theta}
    sample = model.sample(data = data, seed = 123, chains = 1,
       iter_sampling = M, iter_warmup = 0, show_progress = False,
       show_console = False)
    y = sample.stan_variable('y')
    theta_hat = y / N
    ps.extend(theta_hat)
    ts.extend(itertools.repeat(N, M))
xlabel = 'estimated Pr[success]'    
df = pd.DataFrame({xlabel: ps, 'trials': ts})
mydraw(pn.ggplot(df, pn.aes(x = xlabel))
  + pn.geom_histogram(binwidth=0.01)
  + pn.facet_grid('. ~ trials')
  + pn.scales.scale_x_continuous(limits = [0, 1], breaks = [0, 1/4, 1/2, 3/4, 1],
      labels = ["0", "1/4", "1/2", "3/4", "1"], expand=[0, 0])
  + pn.scales.scale_y_continuous(expand=[0, 0, 0.05, 0])
  + pn.theme(aspect_ratio = 1, panel_spacing = 0.15,
             strip_text = pn.element_text(size = 6),
             strip_background = pn.element_rect(height=0.08,
                                fill = "lightgray")))

Although the histograms have different heights and the first one is spiky, the key consideration here is that they all have the same area, representing the 100,000 simulated values of \(y\). The trial size of 10 only has 10 possible values, 0.0, 0.1, …, 1.0, so the histogram (technically a bar chart here) just shows the counts of those outcomes. Here, \(y = 3\) is the most prevalent result, with corresponding estimate for \(\theta\) of \(y / 10 = 0.3\). The trial size of 100 looks roughly normal, as it should as a binomial with trials \(N = 100\). By the time we get to \(N = 1,000\) trials, the draws for \(y\) concentrate near 300, or near the value of \(0.3\) for \(\theta\). As \(N\) grows, the central limit theorem tells us to expect that the width of these histograms to shrink at a rate of \(\mathcal{O}(1 / \sqrt{N})\).

3.2 Pseudorandom numbers

As a probabilistic programming language, Stan relies on random number generation. Because Stan runs on traditional digital computers (i.e., von Neumann machines), it cannot truly generate random numbers. Instead, it does the next best thing and uses a pseudorandom number generator (PRNG) to generate a sequence of numbers deterministically. Specifically, we set a random number generation seed, which when combined with a PRNG, generates a sequence of numbers deterministically that have many of the properties of truly random numbers. The (free online) book by Devroye (1986) is the definitive reference for pseudorandom number generation for statistical distributions and contains a general introduction to PRNGs.

We can see how random number generators work in Stan by running our sampling method with seeds 123, 19876, and 123.

for seed in [123, 19876, 123]:
    sample = model.sample(data = data, seed = seed, chains = 1,
                          iter_sampling = 10, iter_warmup = 0,
                          show_progress = False, show_console = False)
    print(f"{seed = };  sample = {np.asarray(sample.stan_variable('y'), dtype=int)}")             
seed = 123;  sample = [322 324 306 333 311 318 294 323 282 311]
seed = 19876;  sample = [286 303 286 309 289 276 317 298 328 294]
seed = 123;  sample = [322 324 306 333 311 318 294 323 282 311]

The two runs with seed 123 produce the same results. The code to extract values of y is clunky because the stan_variable() method always returns a floating point type and we have converted it to an integer-valued NumPy array.

Generally, we want our Stan programs to replicate similar results with different random seeds. Substantially different results from different seeds is a red flag that there is something wrong with the combination of model and data.

3.3 Monte Carlo integration

Bayesian computation relies on averaging over our uncertainty in estimating parameters. In general, it involves computing expectations, which are weighted averages with weights given by densities. In this section, we will introduce Monte Carlo methods for calculating a simple integral corresponding to the expectation of a discrete indicator variable. We’ll use the textbook example of throwing darts at a board randomly and using the random locations to estimate the mathematical constant \(\pi\).

We start with a two-unit square centered at the origin. Then we will generate points uniformly at random in this square. For each point \((x, y)\), we will calculate whether it falls inside the unit circle circumscribed within the square, which is true if the distance to the origin is less than 1, \[ \sqrt{x^2 + y^2} < 1, \] which simplifies by squaring both sides to \[ x^2 + y^2 < 1. \] The proportion of such points gives the proportion of the square’s volume taken up by the circle. Because the square is \(2 \times 2\), it has an area of 4, so the circle has an area of 4 times the proportion of points falling inside the circle (i.e., in the open unit disc).

Here’s the Stan code.

stan/monte-carlo-pi.stan
generated quantities {
  real<lower=-1, upper=1> x = uniform_rng(-1, 1);
  real<lower=-1, upper=1> y = uniform_rng(-1, 1);
  int<lower=0, upper=1> inside = x^2 + y^2 < 1;
  real<lower=0, upper=4> pi = 4 * inside;
}

The program declares variables x and y and constrains them to fall in the interval \((-1, 1)\) (numerical overflow may produce values -1 and 1) and assigns them uniform random values. The indicator variable inside is set to 1 if the Euclidean length of the vector \(\begin{bmatrix}x & y\end{bmatrix}\) is less than 1 (i.e., it falls within an inscribed unit circle) and is set to 0 otherwise.
The variable pi is then set to four times the indicator value. As we see below, it is the sample mean of these values that is of interest.

First, we compile and then sample from the model, taking a sample size of M = 10_000 draws. Then we plot the draws.

M = 100 if DRAFT else 10_000
model = csp.CmdStanModel(stan_file = '../stan/monte-carlo-pi.stan')
sample = model.sample(chains = 1, iter_warmup = 0, iter_sampling = M,
                      show_progress = False, show_console = False,
                      seed = 123)
x_draws = sample.stan_variable('x')
y_draws = sample.stan_variable('y')
inside_draws = [int(i) for i in sample.stan_variable('inside')]
pi_draws = sample.stan_variable('pi')
inside_named_draws = np.array(["out", "in"])[inside_draws]
df = pd.DataFrame({'x': x_draws, 'y': y_draws,
                   'inside': inside_named_draws})
mydraw(
  pn.ggplot(df, pn.aes(x = 'x', y = 'y',
                group='inside', color='inside'))
  + pn.geom_point(size = 0.1)
  + pn.labs(x = 'x', y = 'y')
  + pn.coord_fixed(ratio = 1)
)

Next, we take the sample mean of the inside-the-circle indicator, which produces an estimate of the probability of a point being inside the circle. This corresponds directly to the expectation \[\begin{align} \mathbb{E}[4 \cdot \textrm{I}(\sqrt{X^2 + Y^2} \leq 1)] &= \int_{-1}^1 \int_{-1}^1 \textrm{I}(x^2 + y^2 <1) \cdot p(x, y) \, \textrm{d}x \, \textrm{d}y \\[4pt] &= \int_{-1}^1 \int_{-1}^1 \textrm{I}(x^2 + y^2 < 1) \cdot \textrm{uniform}(x \mid -2, 2) \cdot \textrm{uniform}(y \mid -2, 2) \, \textrm{d}x \, \textrm{d}y \\[4pt] &= \int_{-1}^1 \int_{-1}^1 4 \cdot \textrm{I}(x^2 + y^2 < 1) \, \textrm{d}x \, \textrm{d}y \\[4pt] &= \pi, \end{align}\] where \(\textrm{I}()\) is the indicator, which returns 1 if its argument is true and 0 otherwise. The posterior mean of the variable inside is the probability that a random point in the 2-unit square is inside the inscribed unit circle. The posterior mean for pi is thus our estimate for \(\pi\).

Pr_is_inside = np.mean(inside_draws)
pi_hat = np.mean(pi_draws)
print(f"Pr[Y is inside circle] = {Pr_is_inside:.3f};")
print(f"estimate for pi = {pi_hat:.3f}")
Pr[Y is inside circle] = 0.786;
estimate for pi = 3.144

The true value of \(\pi\) to 3 digits of accuracy is \(3.142\), so we are close, but not exact, as is the nature of Monte Carlo methods. If we increase the number of draws, our error will go down. Theoretically, with enough draws, we can get any desired precision; in practice, we don’t have that long to wait and have to make do with only a few digits of accuracy in our Monte Carlo estimates. This is usually not a problem because statistical uncertainty still dominates our numerical imprecision in most applications; we discuss this important point later when contrasting estimation uncertainty and sampling uncertainty in the section on posterior predictive inference and when considering practical guidance on how long to run Stan.

Random points are far away in high dimensions

Suppose we wanted to sample points in the unit disc? One thing we could do is sample points in the unit square until we draw one that is in the unit disc. In two dimensions, this is fairly efficient, with 79% of the points falling in the circle. But what will happen in higher dimensions? Let’s write some Stan code and see.

stan/unit-hypersphere.stan
data {
  int<lower=0> D;
}
transformed data {
  array[D] real one_D = rep_array(1, D);
}
generated quantities {
  array[D] real y = uniform_rng(-1, one_D));
  int<lower=0, upper=1> inside = norm2(y) < 1;
}

In this case, we take a natural number D as input for the dimensionality of the hypercube. We have introduced a block for transformed data, and declared a variable one_D to be a size D array of 1 values. The transformed data block is executed once as data is read in and its values are constant outside of the transformed data block. In the generated quantities block, we use the array of 1 values to assign y to an array of values, each of which is independently generated from a \(\textrm{uniform}(-1, 1)\) distribution. This means y will be uniformly distributed within the hypercube \([-1, 1]^D\). The integer inside will be set to 1 if y falls within the unit hypersphere centered at the origin (i.e., circumscribed within the unit hypercube).

By construction, the distance from the origin to the side of the hypercube and hence the radius of the inscribed hypersphere remain constant at 1. In contrast, the distance from the origin to a corner is \(\sqrt{D} = \sqrt{\underbrace{1 + 1 + \cdots 1}_{d \ \textrm{times}}\) in \(D\) dimensions. Now let’s see what happens to the proportion of volume in the inscribed hypersphere as the dimensionality grows.

M = 100 if DRAFT else 10_000
model = csp.CmdStanModel(stan_file = '../stan/unit-hypersphere.stan')
in_probs = np.repeat(1.0, 13)
for D in range(1, 13):
    sample = model.sample(chains=1, iter_warmup = 0, iter_sampling = M,
                          data = {'D' : D}, show_progress = False,
              show_console = False, seed = 123)
    inside_draws = sample.stan_variable('inside')
    in_probs[D] = np.sum(inside_draws) / M

mydraw(pn.ggplot(pd.DataFrame({'D':np.arange(1, 13), 'prob in hypersphere':in_probs[1:]}),
                 pn.aes(x = 'D', y='prob in hypersphere'))
       + pn.geom_line()
       + pn.geom_point(size=1)
       + pn.scale_x_continuous(breaks = [0, 2, 4, 6, 8, 10, 12]))

3.4 Markov chain Monte Carlo methods

In the previous sections, we generated a sample of draws by taking a sequence of independent draws. We then just averaged results to get plug-in estimates for expectations.

In modern applied Bayesian modeling, it is almost never possible to generate independent draws from the posterior distribution and apply pure Monte Carlo methods. The only case where it is possible is within the highly restrictive exponential family of sampling distributions with conjugate priors (Diaconis and Ylvisaker 1979). Before the MCMC revolution of the 1990s, Bayesian inference was largely restricted to conjugate priors. Even if the entire model wasn’t purely conjugate, conjugate priors remained popular due to the use of Gibbs sampling for MCMC, either hand-built or coded in in the first probabilistic programming language, BUGS (Lunn et al. 2012).

The introduction of automatic differentiation opened up the possibility of coding the more efficient and scalable Hamiltonian Monte Carlo method in Stan. This has led to the adoption of more non-conjugate priors and general priors.

In Markov chain Monte Carlo methods, we base each draw on the previous draw. A sequence of random variables each of which depends only on the previous variable generated is called a Markov chain. That is, a sequence of random variables \(Y_1, Y_2, \ldots\) makes up a Markov chain if \[ p_{Y_{n+1} | Y_{1}, \ldots Y_N}(y_{n + 1} | y_1, \ldots, y_n) = p_{Y_{n+1} \mid Y_n}(y_{n+1} \mid y_n) \]

We can illustrate with a simple example of three Markov chains, all of which have a stationary distribution of \(\textrm{bernoulli}(0.5)\) and thus the long-term average of all chains should approach 0.5. We will introduce a parameter \(\theta \in (0, 1)\) and take the probabilities of element \(Y_{n+1}\) depending on the previous element \(Y_{n}\) to be \[\begin{align} \Pr[Y_{n + 1} &= 1 \mid Y_n = 1] = \theta \\ \Pr[Y_{n + 1} &= 1 \mid Y_n = 0] = 1 - \theta \end{align}\] The first line says that if the last number we generated is 1, the probability of the next element being 1 is \(\theta\). The second line says that if the last number we generated is 0, the probability of the next element being 1 is \(1 - \theta\), and thus the probability of the next element being 0 is \(\theta\). That is, there’s a probability of \(\theta\) of generating the same element as the last element.

Here is a Stan program that generates the first \(M\) entries of a Markov chain over outputs 0 and 1, with probability \(\theta\) of generating the same output again.

stan/markov-autocorelation.stan
data {
  int<lower=0> M;
  real<lower=0, upper=1> rho;  // prob of staying in same state
}
generated quantities {
  array[M] int<lower=0, upper=1> y;  // Markov chain
  y[1] = bernoulli_rng(0.5);
  for (m in 2:M) {
    y[m] = bernoulli_rng(y[m - 1] ? rho : 1 - rho);
  }    
}

The assignment to y[m] in this program is equivalent to this longer form.

    if (y[m - 1] == 1) {
      y[m] = bernoulli_rng(rho);
    } else {
      y[m] = bernoulli_rng(1 - rho);
    }

The more concise form exploits two features of Stan. First, boolean expressions are coded as 1 (true) or 0 (false) in Stan, like in C++ and Python. Because y[m - 1] is constrained to take on values 0 or 1, we know that y[m - 1] == 1 is equivalent to y[m - 1]. Second, it uses the ternary operator, also like in C++. The expression cond ? e1 : e2 involves three arguments, separated by a question mark (?) and a colon (:); it’s value is the value of e1 if cond is true and the value of e2 otherwise. Unlike an ordinary function, the ternary operator only evaluates e1 if the condition is true and only evaluates e2 if the condition is false.

We can simulate these models in Python and print the first 100 values simulated for the Markov chain \(y\).

model = csp.CmdStanModel(
               stan_file = '../stan/markov-autocorrelation.stan')
M = 100 if DRAFT else 1000
rhos = []
iterations = []
draws = []
estimates = []
for rho in [0.05, 0.5, 0.95]:
    data = {'M': M, 'rho': rho}
    sample = model.sample(data = data, seed = 123, chains = 1,
                      iter_warmup = 0, iter_sampling = 1,
                      show_progress = False, show_console = False)
    y_sim = sample.stan_variable('y')
    cum_sum = np.cumsum(y_sim)
    its = np.arange(1, M + 1)
    ests = cum_sum / its
    draws.extend(y_sim[0])
    iterations.extend(its)
    estimates.extend(ests)
    rhos.extend(itertools.repeat(str(rho), M))
df = pd.DataFrame({'draw': draws, 'iteration': iterations,
                   'estimate': estimates, 'rho': rhos})
rho05 = np.array(df.query('rho == "0.05"').head(100)['draw'], dtype = 'int')
rho50 = np.array(df.query('rho == "0.5"').head(100)['draw'], dtype = 'int')
rho95 = np.array(df.query('rho == "0.95"').head(100)['draw'], dtype = 'int')
print("Markov chain draw with probability rho of repeating last value:\n")
print("rho = 0.05:", rho05, "\n")
print("rho = 0.50:", rho50, "\n")
print("rho = 0.95:", rho95, "\n")
Markov chain draw with probability rho of repeating last value:

rho = 0.05: [0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0
 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1] 

rho = 0.50: [0 0 1 0 1 0 0 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0
 1 1 1 1 1 1 1 1 0 1 0 1 0 0 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 0 0
 0 1 0 1 1 1 0 1 0 1 1 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0] 

rho = 0.95: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0] 

With a 0.05 probability of staying in the same state, the Markov chain exhibits strong anti-correlation in its draws, which tend to bounce back and forth between 0 and 1 almost every iteration. In contrast, the 0.95 probability of staying in the same state means the draws have long sequences of 0s and 1s. The 0.5 probability produces independent draws from the Markov chain and we see short runs of 0s and 1s.

Next, we will show a running average of the 0 and 1 draws for 1000 iterations for the three chains.

mydraw(
    pn.ggplot(df, pn.aes(x='iteration', y='estimate',
                  group='rho', color='rho'))
    + pn.geom_hline(yintercept = 0.5, color = 'black')        
    + pn.geom_line()
    + pn.labs(x = "iteration", y = "estimate")
)

The black horizontal line at 0.5 shows the true answer. It is clear that the anti-correlated chain (red) converges much more quickly to the true answer of 0.5 and more stably than the independent chain (green), which in turn converges much more quickly than the correlated chain (blue).

4 Hints for Python programmers

Python follows general programming language idioms like in C++ and Java, whereas Stan follows linear algebra idioms like in R and MATLAB.

4.1 Indexing from 1

Unlike Python, Stan uses the standard mathematical indexing for matrices, which is from 1. If I declare a vector[3], then the valid indexes are 1, 2, and 3. If v is a vector variable, then v[0] is an indexing error and will throw an exception and log a warning as it is caught and the resulting MCMC proposal is rejected.

4.2 Storage order

Unlike Python, Stan uses a column-major memory layout for matrices. This means it’s more efficient to iterate by column than by row through a matrix. To visit every element of a matrix, you want to loop as follows.

matrix[M, N] x;  // M rows, N columns
for (n in 1:N) {
  for (m in 1:M) {
    ... operate on x[m, n] ...
  }
}

Arrays in Stan are stored in row-major order, but they are not stored in a single array like in NumPy’s ndarray type. Arrays of primitives store the primitive values in memory-local order. Arrays of containers store pointers to the containers. Thus accessing array members can be non-local in memory, but will not require copying. In contrast, accessing a row or column of a matrix usually requires allocating memory and copying.

4.3 Inclusive ranges

Unlike Python, Stan uses inclusive ranges, so that 1:3 represents the sequence 1, 2, 3. The main disadvantage of inclusive notation is that the length of L:U is U - L + 1.

Putting these together, Python loops over a container with N elements look as follows.

v = np.random.rand(N)
for n in range(0, N):
    ... process v[n] ...

The Python version visits elements v[0], v[1], …, v[N - 1].

Loops in Stan are as follows.

vector[N] v;
for (n in 1:N) {
   ... process v[n] ...
}

The Stan version visits elements v[1], v[2], …, v[N].

4.4 Strong typing

Stan variables are strongly typed. This means every variable is assigned a type when it is declared and that type never changes. Only values of expressions of the same type of the variable may be assigned to it. Block variables are also sized and the size never changes after the declaration is executed.

There are three one dimensional (1D) real-valued container types (1D array, column vector, and row vector) and four two dimensional (2D) containers (2D array, 1D array of vectors, 1D array of column vectors, and matrix). There are functions to convert back and forth for common cases, but complicated cases must be handled manually. Loops that just do reassignment are fast in Stan as there is no overhead from derivative calculations; see the section on automatic differentiation for more details.

4.5 Block scope

Stan follows the C/C++ convention whereby variables declared within a block are local to that block. For example, in this Stan program, logit_theta is only a valid variable within the conditional block; once control has left the conditional block, logit_theta falls out of scope.

if (theta < 0.5) {
  real logit_theta = logit(theta);
  ...
}
logit_theta = 1.9; // ILLEGAL: OUT OF SCOPE

The solution is to move up the declaration.

real logit_theta;
if (theta < 0.5) {
  logit_theta = logit(theta);
  ...
}
logit_theta = 1.9;  // OK

4.6 Loops are not slow

Unlike in R and Python, loops are fast in Stan because Stan is a compiled language. For operations that only involve indexing and assignment, loops can be faster in Stan than their vectorized counterparts, because they avoid intermediate allocations.

However, this is only half of the story. Whenever functions are applied to parameters, the operation, its result, and its operands are recorded in an expression graph. For operations that involve functions other than indexing or reshaping operations applied to parameters, vectorized versions of functions will almost always reduce peak memory usage and increase speed.

4.7 Whitespace and semicolons

Stan follows C/C++ conventions on whitespace in which any sequence of whitespace characters (space, tab, new line) are interchangeable semantically. Python and R are both space-sensitive in different ways—Python uses space as a block delimiter and R has an eager line-based parser. In Stan, we use semicolons (;) to mark the end of an atomic statement. This means it is OK to continue expressions on the following line without any special syntactic marker, e.g.,

lp[k] = bernoulli_lpmf(k | theta)
        + normal_lpdf(y[n] | mu[k], sigma[k]);

Both R and Python here would try to terminate the assignment to lp after the first line and leave a dangling expression for the second line. We recommend following mathematical conventions and breaking a line before an operator, ideally right before a term in a chain of additions or a factor in a chain of multiplications.

5 Stan example: Laplace’s live birth inverse problem

Given a forward model that generates data from parameters, the inverse problem is that of inferring parameter values from observed data.

5.1 Bayesian solutions to inverse problems

Bayes (1763) introduced the paradigm of statistical inference that has come to be known as Bayesian statistics.

Parametric statistics and variable types

Before introducing Bayes’s paradigm, we will try to settle on some terminology around variables. In Stan, we classify each variable as being a data variable or a parameter.

When talking about Bayesian statistics abstractly (i.e., without regard to a specific model form), we will use the variable \(y\) to represent all of the model’s data, which are values that are known or observed. Data my include

  • constants: sizes, bounds,
  • unmodeled data: covariates (aka features), constant parameters of priors, and
  • modeled data: outcome measurements, individual or group-level covariates in generative models.

Similarly, we us the variable \(\theta\) for all of our model’s parameters, which are values that are not known and must be inferred. Parameters may include

  • process parameters: parameters of the forward scientific process,
  • measurement parameters: parameters of the measurement process,
  • hyperpriors: hyperparameters for prior distributions,
  • missing data: unknown data items, latent but potentially observable values, and
  • predictive quantities: event probability indicators, predictions, forecasts, backcasts.

Constants are things like the sizes of matrices, the number of components in a mixture model, the bounds on a constrained variable, etc. Covariates are typically used as inputs (aka dependent variables) to a regression. For instance if I might want to predict someone’s presidential vote based on their income and state of residence. In this case, the income and state of residence are unmodeled covariates and the vote is modeled data if it is known.

In Stan, the number of parameters may depend on the data, but once the data \(y\) is fixed, the number of parameters is also fixed. This means we can implement some non-parametric models (e.g., Gaussian processes), but have to approximate others (e.g., Dirichlet processes).

The Bayesian process

Gelman et al. (2013) summarize the process of developing a Bayesian model for an applied problem as follows,

  1. _Define a joint probability model over all of the observables and non-observables that is consistent with what we know about the underlying scientific process and underlying measurement process (which together form the data-generating process) and our prior knowledge.

  2. Condition on observed data to infer the posterior distribution of any quantities of interest such as predictions or event probabilities.

  3. Evaluate the model in terms of fit to data and resulting predictions, how sensitive the predictions are to model assumptions, and if necessary, go back to step (1) and refine the model.

Usually, the joint probability model here is defined in terms of a parametric sampling density \(p(y \mid \theta)\) which is a forward model describing how to generate observations \(y\) given parameters \(\theta\), and a prior density \(p(\theta)\) capturing what is known about the unknowns before observing the current data \(y\).

Stan is useful for expressing the model (1), performing posterior inference (2), and evaluating and comparing models (3).

Bayes’s formulation

Bayes (1763) formulated the inverse problem of determining a posterior distribution with density \(p(\theta \mid y)\) from a prior with density \(p(\theta)\) and a sampling distribution with density \(p(y \mid \theta)\). Bayes provided the following derivation, which has come to be known as Bayes’s rule. \[\begin{align} p(\theta \mid y) &= \frac{p(y, \theta)} {p(y)} & \textrm{[definition of conditional probability]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {p(y)} & \textrm{[chain rule]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {\int_{\Theta} \, p(y, \theta) \, \textrm{d}\theta} & \textrm{[law of total probability]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {\int_{\Theta} \, p(y \mid \theta) \cdot p(\theta) \, \textrm{d}\theta}. & \textrm{[chain rule]} \end{align}\]

The first steps reveal the key densities involved in Bayesian inference for data \(y\) and parameters \(\theta\), \[ \underbrace{p(\theta \mid y)}_{\textrm{posterior}} = \underbrace{p(y \mid \theta)}_{\textrm{likelihood}} \cdot \underbrace{p(\theta)}_{\textrm{prior}} \ / \underbrace{p(y)}_{\textrm{evidence}}. \] When considered as a function of the parameter \(\theta\) for fixed data \(y\), as it is in Bayes’s rule, the density \(p(y \mid \theta)\) is known as the likelihood.

The second step expresses the evidence in terms of the likelihood and prior. \[ p(y) = \int_{\Theta} p(y \mid \theta) \cdot p(\theta) \, \textrm{d}\theta. \] In applications, this integral is usually intractable in the sense of there being no closed-form analytic solution in terms of elementary functions. For example, even a simple logistic regression presents an intractable integral for the evidence, with \[\begin{align} p(\theta) &= \textrm{normal}(\theta, 1), \\[2pt] p(y \mid \theta, x) &= \prod_{n=1}^N \textrm{bernoulli}(y_n \mid \textrm{logit}^{-1}(x_n \cdot \theta)) \end{align}\]

The good news is that most applications of Bayesian statistics do not require us to solve the integral presented in the denominator of Bayes’s rule, even numerically. One case where we need normalizing constants is to evaluate Bayes factors for model comparison, but like Gelman et al. (2013), we prefer other means of evaluating models like posterior predictive checks or cross-validation, both of which are beyond the scope of this getting-started tutorial.

Rather than computing the evidence, we work up to a proportion (i.e., a multiplicative constant), where we have \[ p(\theta \mid y) \propto p(y \mid \theta) \cdot p(\theta), \] because \(p(y)\) is a constant in the sense that it does not depend on \(\theta\). This is what lets us express the log posterior as the sum of the log likelihood and the log prior in Stan, \[ \log p(\theta \mid y) = \log p(y \mid \theta) + \log p(\theta) + \textrm{const}, \] where the constant is the negative log evidence, \(- \log p(y)\).

5.2 Live birth data

A decade after Bayes published his rule, Pierre Simon Laplace (1774) used Euler’s beta function to derive the conjugate prior for binomially distributed data. A conjugate prior for a sampling distribution is a family of distributions such that if the prior is drawn from that family, the posterior will be, too. In this section, we will work through Laplace’s analysis in what has become the standard textbook introduction to Bayesian inference.

Laplace gathered data on the sexes of babies in live births in Paris between 1745 and 1770.

Live births in Paris between 1745 and 1770.
sex live births
female 105,287
male 110,312

It sure looks like there were more boys than girls born.

5.3 Laplace’s model

With \(y\) male births out of \(N\) total births, Laplace adopted the sampling distribution \[ y \sim \textrm{binomial}(N, \theta), \] which requires a number of trials \(N \in \mathbb{N}\), chance of success \(\theta \in (0, 1)\), and number of successes \(y \in 0{:}N\).

Laplace used the prior \[ \theta \sim \textrm{beta}(1, 1), \] where \[ \textrm{beta}(\theta \mid \alpha, \beta) \propto \theta^{\alpha - 1} \cdot (1 - \theta)^{\beta - 1}. \] for \(\alpha, \beta \in (0, \infty)\), and \(\theta \in (0, 1)\). The distribution \(\textrm{beta}(1, 1)\) is uniform over probabilities \(\theta \in (0, 1)\) because the density is proportional to a constant, \[\begin{align} \textrm{beta}(\theta \mid 1, 1) &\propto \theta^{1 - 1} \cdot (1 - \theta)^{1 - 1} \\[4pt] &= 1. \end{align}\]

Analytic posterior

As an aside, Laplace’s model here is simple enough to solve for the posterior analytically.
\[\begin{align} p(\theta \mid y, N) &\propto p(y \mid N, \theta) \cdot p(\theta) \\[4pt] &= \textrm{binomial}(y \mid N, \theta) \cdot \textrm{beta}(\theta \mid 1, 1) \\[4pt] &\propto \theta^y \cdot (1 - \theta)^{N - y} \cdot \theta^{1 - 1} \cdot (1 - \theta)^{1 - 1} \\[4pt] &= \theta^{y + 1} \cdot (1 - \theta)^{N - y + 1} \\[4pt] &\propto \textrm{beta}(\theta \mid y + 1, N - y + 1). \end{align}\] Despite the intermediate proportional-to steps, the normalizing constant is unique, so we can conclude that \[ p(\theta \mid y, N) = \textrm{beta}(\theta \mid y + 1, N - y + 1). \] This entire derivation went through without ever worrying about the normalizing constant for the beta distribution or the binomial distribution.

5.4 Stan program for birth data

Unlike the first Stan model we saw, which only generates data, the following Stan program requires data to be provided, specifically the the number of male births (\(y\)) and the total number of births (\(N\)). The model will then allow us to estimate the probability of a male birth (\(\theta\)) as well as the probability that boys are more likely to be born than girls (\(\theta > 0.5\)). The Stan program for Laplace’s model is as follows.

stan/sex-ratio.stan
data {
  int<lower = 0> N;  // number of live births
  int<lower = 0, upper = N> y;  // male births
}
parameters {
  real<lower=0, upper=1> theta;  // chance of boy
}
model {
  theta ~ beta(1, 1);  // prior
  y ~ binomial(N, theta);  // binomial sampling
}
generated quantities {
  int<lower=0, upper=1> boys_gt_girls = theta > 0.5;
}

5.5 Parameter and model blocks

In this Stan program, we see that both the number of total births \(N\) and the number of male births \(y\) are given as data. Then there are two additional blocks we did not see in our earlier program, a parameters block, which is used to declare unknown values (here, just the male birth rate \(\theta\)), and a model block, which is where we define our target log density up to an additive constant. The target log density is typically a Bayesian posterior expressed as a log prior and log sampling distribution. The parameters block declares the type of theta, which is a real value constrained to fall in the interval \([0, 1]\). The model block defines the prior, which we take to be uniform over the possible values for theta. The model block also defines the sampling distribution, which codes the fact that the observed data y was generated from a binomial distribution with N trials and theta probability of a male birth. Finally, we have a generated quantities block that defines a single binary indicator variable, boys_gt_girls. This variable will take the value 1 if the probability of a boy is greater than the probability of a girl.

5.6 Sampling from the posterior

When we run a Stan program, Stan generates a sequence of \(M\) random draws, which are approximately identically distributed according to the posterior, \[ \theta^{(1)}, \ldots, \theta^{(M)} \sim p(\theta \mid y). \] If we were to take \(M \rightarrow \infty\), the draws will converge to being identically drawn from the posterior. In simple or well behaved models, coupling arguments show that they converge to true draws from the posterior in hundreds or thousands (or sometimes more) draws (Jacob, O’Leary, and Atchadé 2020); they become numerically indistinguishable from true draws well before that time.

Stan’s sampler

Stan uses a Markov chain Monte Carlo (MCMC) algorithm, which can lead to autocorrelation in the random draws from the posterior. That is, the draws are not typically independent, with each draw being correlated (or anti-correlated) with the previous draw. This autocorrelation does not introduce bias into the Monte Carlo estimates.

In order to scale to high dimensional problems, we need to exploit gradients and/or Hessians. Duane et al. (1987) introduced the gradient-based Hamiltonian Monte Carlo (HMC) algorithm; the overview by Neal (2011) discusses theoretical and practical scaling results and Livingstone et al. (2019) provide geometric ergodicity results.

For Stan, Hoffman and Gelman (2014) developed an adaptive form of Hamiltonian Monte Carlo (HMC) known as the no-U-turn sampler (NUTS), and Betancourt (2017a) improved it with an adaptive preconditioner and multinomial sampling over trajectories. NUTS can be hyper-efficient in the sense of generating anti-correlated draws that can lead to more efficient Monte Carlo estimates than independent draws.

Fitting in Stan

We fit Laplace’s model by compiling the model, constructing a dictionary for the data, and then calling the sample method on the compiled model with the dictionary. We call the sample method with 1,000 warmup iterations and 10,000 sampling iterations; we are taking so many draws in order to draw show smooth plots later. The Stan programs considered in this introduction are all quite simple and inexpensive to run for many iterations. We consider how many draws are required for statistical inference in the section on practical guidelines.

model = csp.CmdStanModel(stan_file = '../stan/sex-ratio.stan')
boys = 110312
girls = 105287
data = {'N': boys + girls, 'y': boys}
M = 100 if DRAFT else 10_000
sample = model.sample(data = data, seed = 123,
                      iter_sampling = M, iter_warmup = 1000,
                      show_progress = False, show_console = False)

As before, we proceed by first extracting the draws for the variables theta and boys_gt_girls.

theta_draws = sample.stan_variable('theta')
boys_gt_girls_draws = sample.stan_variable('boys_gt_girls')

We can plot a histogram of approximate draws \(\theta^{(m)} \sim p(\theta \mid N, y)\) from the posterior to give us a sense of the value of \(\theta\) and its uncertainty given our observed data \(y\).

mydraw(
  pn.ggplot(pd.DataFrame({'theta': theta_draws}),
            pn.aes(x = 'theta')) +
  pn.geom_histogram(color='white') +
  pn.labs(x = 'θ') +
  pn.theme(axis_text_y = pn.element_blank(),
           axis_title_y = pn.element_blank(),  
           axis_ticks_major_y = pn.element_blank())
)

All of the draws have a value for \(\theta\) between 0.50 and 0.52. In the next sections, we will see how to use these draws to estimate a single value for \(\theta\) as well as to compute probabilities, such as the probability that \(\theta > 0.5\) or \(\theta > 0.51\).

5.7 Bayesian point estimates

In Bayesian terms, a point estimate for a parameter \(\Theta\) conditioned on some observed data \(Y = y\) is a single value \(\hat{\theta} \in \mathbb{R}^D\) that in some way summarizes the posterior \(p(\theta \mid y)\). Formally, Bayesian estimators are defined in such a way as to minimize a loss function.

Posterior mean estimator

The most common Bayesian point estimate for a parameter is the posterior mean,

\[\begin{align} \widehat{\theta} &= \mathbb{E}[\Theta \mid Y = y] \\[6pt] &= \int_{\Theta} \theta \cdot p(\theta \mid y) \, \textrm{d}\theta \\[6pt] &= \lim_{M \rightarrow \infty} \, \frac{1}{M} \sum_{m=1}^M \theta^{(m)} \\[6pt] &\approx \frac{1}{M} \sum_{m=1}^M \theta^{(m)}, \end{align}\] where in the last two lines, each draw is distributed approximately according to the posterior, \[ \theta^{(m)} \sim p(\theta \mid y). \]

We have snuck in conditional expectation notation in the first line of this definition. Expectations are just weighted averages, with weights given by a probability density. Bayesian inference involves expectations over the posterior, the concise notation for which is conditional expectation notation, \[ \mathbb{E}\! \left[ f(\Theta) \mid Y = y \right] \ = \ \int_{\Theta} f(\theta) \cdot p_{\Theta \mid Y}(\theta \mid y) \, \textrm d\theta, \] where \(\Theta\) and \(Y\) are random variables with realizations \(\theta \in \Theta\) and \(y \in Y\) (we write \(y \in Y\) for a random variable \(Y\) if \(y\) is in support of \(Y\)’s density, i.e., \(p_Y(y) > 0\)).

For Laplace’s model, the estimate for the male birth rate \(\theta\) conditioned on the birth data \(y\) is calculated as the sample mean for the extracted draws for theta.

theta_hat = np.mean(theta_draws)
print(f"estimated theta = {theta_hat:.3f}")
estimated theta = 0.512

Posterior median estimator, quantiles, and intervals

A popular alternative Bayesian point estimate is the posterior median, \(\theta^+\). The median is such that for each dimension \(d \in 1{:}D\), \[ \Pr[\Theta_d \leq \theta^+_d] = \frac{1}{2}. \]

The posterior median can be calculated by taking the posterior median of the draws, as follows.

theta_plus = np.median(theta_draws)
print(f"estimated (median) theta = {theta_plus:.3f}")
estimated (median) theta = 0.512

Because our posterior distribution is nearly symmetric with Laplace’s data, the posterior mean and posterior median are very close.

Quantiles

Other posterior quantiles are estimated the same way. For example, if we want the posterior 95% quantile, we just take the empirical 95% point in the sorted chain of draws. For example, here are the 5% quantile and 95% quantile for Laplace’s posterior, calculated with empirical quantiles.

quantile_05 = np.quantile(theta_draws, 0.025)
quantile_95 = np.quantile(theta_draws, 0.975)
print(f"""0.05 quantile = {quantile_05:.3f};
0.95 quantile = {quantile_95:.3f}""")
0.05 quantile = 0.510;
0.95 quantile = 0.514

Posterior intervals

Together, the 5% quantile and 95% quantile give us the bounds of our 90% central probability interval, which is defined as the interval containing 90% of the posterior probability mass such that half of the remaining mass (5%) is below the interval and half (5%) is above.

print(f"central 90% posterior interval for theta")
print(f"    = ({quantile_05:.3f}, {quantile_95:.3f})")
central 90% posterior interval for theta
    = (0.510, 0.514)

Other intervals are computed in the exact same way.

Estimation error and bias

The error of an estimate is its difference from the true value, \[ \textrm{err} = \hat{\theta} - \theta. \] Our estimate \(\hat{\theta}\) is implicitly a function of the data \(y\) and so is \(\textrm{err}\), so we can make this explicit and write \[ \textrm{err}(y) = \hat{\theta}(y) - \theta. \]

The bias of an estimator is defined as its expected error (as averaged over the data distribution for the random variable \(Y\)), \[\begin{align} \textrm{bias} &= \mathbb{E}[\textrm{err}(Y)] \\[6pt] &= \mathbb{E}[\hat{\theta}(Y) - \theta] \\[6pt] &= \int_Y \hat{\theta}(y) - \theta \ \textrm{d}y. \end{align}\]

Properties of estimators

The posterior mean is a popular Bayesian estimator for two reasons. First, it is an unbiased estimator in the sense of having zero bias. Second, it has the minimum expected square error among unbiased estimators, where the squared error of an estimate is defined by \[ \textrm{err}^2(y) = \left(\hat{\theta}(y) - \theta\right)^2. \] Posterior means might not exist if the posterior has very wide tails, like the standard Cauchy distribution, \(\textrm{cauchy}(x \mid 0, 1) \propto \frac{1}{1 + x^2}\).

The posterior median \(\theta^+\) has three pleasant properties. First, it is always well defined, even for densities with poles or very wide tails. Second, it is more robust to outliers than squared error because errors are not amplified by squaring. Third, the posterior median estimator minimizes expected absolute error.

The posterior mode may not exist. For example, there is no posterior mode in standard hierarchical models. There is not even a mode for a simple density with a pole, such as \(\textrm{exponential}(x \mid 1) = \exp(-x)\) for \(x > 0\). The posterior mode is at least consistent in the sense of converging to the true value as the data size grows in cases where it exists.

We will concentrate on posterior means in this quick introduction to Bayes and Stan.

(Markov chain) Monte Carlo error and effective sample size

The Markov chain we use to sample is itself a random variable. Re-running the sampler will produce slightly different results due to Monte Carlo error (the error introduced by using only a finite sample of \(M\) draws).

Stan reports Markov chain Monte Carlo standard error along with its estimates of the mean. The MCMC standard error is for a scalar parameter \(\theta_d\) and is defined as \[ \textrm{mcmc-se} = \frac{\textrm{sd}[\Theta_d \mid Y = y]}{N^{\textrm{eff}}}, \] where the numerator is the standard deviation of the parameter \(\theta_d\) in the posterior and \(N^{\textrm{eff}}\) is the effective sample size. The posterior variance and standard deviation are defined as follows.

\[\begin{align} \textrm{var}\left[\Theta_d \mid Y = y\right] &= \mathbb{E}\left[\left(\Theta_d - \mathbb{E}\left[\Theta_d \mid Y = y\right]\right)^2 \ \big| \ Y = y\right] \\[6pt] \textrm{sd}\left[\Theta_d \mid Y = y\right] &= \sqrt{\textrm{var}[\Theta_d \mid Y = y]}. \end{align}\]

In the usual central limit theorem, the sample size (number of independent draws) appers in place of \(N^{\textrm{eff}}\). The effective sample size for a sample of size \(M\) is defined to be \[ N^{\textrm{eff}} = \frac{M}{\textrm{IAT}}, \] where \(\textrm{IAT}\) is the integrated autocorrelation time. We are not going to define it formally, but it may be thought as the interval between effectively independent draws in our Markov chain. If we have low autocorrelation, \(\textrm{IAT}\) will be close to 1 and if the autocorrelation is higher, it can be much higher. If the \(\textrm{IAT}\) is much higher than 100, it can become difficult to estimate. If the autocorrelation is negative, the \(\textrm{IAT}\) is less than 1 and the effective sample size is larger than the number of draws. Thus \(N^{\textrm{eff}}\) is the number of independent draws that would lead the same error as our correlation draws using a Markov chain.

5.8 Estimating event probabilities

Laplace wasn’t looking for a point estimate for \(\theta\). He wanted to know the probability that \(\theta > \frac{1}{2}\) after observing \(y\) male births in \(N\) trials. In the notation of probability theory, he wanted to estimate an event probability.

A subset of parameters is known as an event. We can convert conditions on parameters into events. For example, the condition \(\theta > \frac{1}{2}\) can be turned into the event \[ A = \left\{ \theta \in \Theta : \theta > \frac{1}{2} \right\}. \] Events are what are assigned probabilities by a measure in probability theory. Given a probability measure, the probability of the event \(A\), that the rate of boy births is higher than girl births, will be well defined. Because we can convert conditions to events, we will be sloppy and treat the conditions as if they were events. This allows us to write \(\Pr\!\left[\Theta > \frac{1}{2} \, \big| \, N, y\right]\) for the probability of the event \(\Theta > \frac{1}{2}\).

Event probabilities via indicators

The indicator function \(\textrm{I}\) maps propositions into the value 1 if they are true and 0 if they are false. For example, \(\textrm{I}(\theta > \frac{1}{2}) = 1\) if the proposition \(\theta > \frac{1}{2}\) is true, i.e., when \(\theta\) is greater than one half.

Event probabilities are defined as posterior conditional expectations of indicator functions for events. \[\begin{align} \Pr[\Theta > 0.5 \mid N, y] &= \mathbb{E}\!\left[\textrm{I}[\Theta > 0.5] \,\big|\, N, y\right] \\[8pt] &= \int_{\Theta} \textrm{I}(\theta > 0.5) \cdot p(\theta \mid N, y) \, \textrm{d}\theta \\[8pt] &\approx \frac{1}{M} \sum_{m=1}^M \textrm{I}(\theta^{(m)} > 0.5), \end{align}\] where we assume \(\theta^{(m)} \sim p(\theta \mid N, y)\) is distributed according to the posterior for \(m \in 1{:}M\). Following physics conventions, we use square brackets for functors (functions that apply to functions); that means we write \(\textrm{I}[\cdot]\) when we apply the indicator function to a random variable and \(\textrm{I}(\cdot)\) when we apply it to a primitive scalar.

Events as indicators in Stan

In Stan, we code the value of the indicator function directly and assign it to a variable in the generated quantities block.

generated quantities {
  int<lower=0, upper=1> boys_gt_girls = theta > 0.5;
}  

Conditional expressions like theta > 0.5 take on the value 1 if they are true and 0 if they are false. In mathematical notation, we would write \(\textrm{I}(\theta > 0.5)\), which takes on value 1 if \(\theta > 0.5\) and 0 otherwise; in Stan, like in C++, we treat > as a binary operator that returns either 0 or 1, so we just write theta > 0.5 in Stan.

The answer to Laplace’s question

The posterior mean of the variable boys_gt_girls is thus our estimate for \(\Pr[\theta > 0.5 \mid N, y]\). It is essentially 1. Printing to 15 decimal places, we see

Pr_boy_gt_girl = np.mean(boys_gt_girls_draws)
print(f"estimated Pr[boy more likely] = {Pr_boy_gt_girl:.15f}")
estimated Pr[boy more likely] = 1.000000000000000

The value of 1 returned as an estimate brings up the important problem of numerical precision. As we can see from the histogram, all of our sampled values for \(\theta\) are greater than \(\frac{1}{2}\).

Laplace calculated the result analytically, which is \[ \Pr\!\left[\Theta > \frac{1}{2} \ \bigg| \ N, y\right] \approx 1 - 10^{-27}. \] Thus we would need an astronomical number of posterior draws before we would generate a value of \(\theta\) less than \(\frac{1}{2}\). As given, the answer of 1.0 is very close to the true answer and well within our expected Monte Carlo error. As an aside, using the standard double-precision (8 byte, 64 bit) floating point representation for real numbers, the number \(1 - 10^{-27}\) will round to \(1\) because machine precision, which is defined as the largest \(\epsilon\) such that \(1 - \epsilon \neq 1\), is about \(10^{-16}\). Let’s try it in Python, which, like Stan, uses double-precision arithmetic by default.

print(f"{(1.0 == (1.0 - 1e-27)) = }")
(1.0 == (1.0 - 1e-27)) = True

which one would expect given \(10^{-27}\) is below the machine precision,

print(f"Machine precision: {np.finfo(float).eps}")
Machine precision: 2.220446049250313e-16

MCMC summary statistics from Stan

With Stan, we can print a summary for the variable \(\theta\) in the posterior, which reports all of these values. We just call The .summary() function on the sample.

sample.summary(sig_figs = 3)
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -149000.000 0.004790 6.660000e-01 -149000.00 -149000.000 -149000.000 19300.0 35200.0 1.0
theta 0.512 0.000009 1.090000e-03 0.51 0.512 0.513 15300.0 28000.0 1.0
boys_gt_girls 1.000 NaN 9.380000e-14 1.00 1.000 1.000 NaN NaN NaN

The rows are for our random variables, with lp__ indicating the unnormalized log density defined by the Stan program. The unnormalized log density includes any change-of-variables adjustments required for transforming constrained parameters; the transforms for constrained parameters and associated change-of-variables adjustments are detailed in (Stan Development Team 2023b).

The other two rows are for variables defined in the Stan program, theta, and boys_gt_girls. The number of significant figures used in the results can be controlled in the summary function, as can the quantiles being reported.

The first column reports the posterior mean, and agrees with our earlier calculations for both variables. The second column is the Monte Carlo standard error (based on an estimated effective sample size) and a posterior standard deviation estimate. The next three columns are quantiles we computed earlier, and they also agree with our calculations. Next is the effective sample size, which can vary from variable to variable, and the effective sample size rate (per second). The final column reports \(\widehat{R}\), which we discuss in the next section.

6 Warmup and convergence monitoring

When running Markov chains, we want to make sure that we have moved far enough that our draws are approximately from the posterior. A standard way to monitor convergence is to start multiple Markov chains at different initializations (ideally chosen from a diffuse initialization like a draw from the prior) and measure whether they are producing draws from the same distribution.

6.1 Warmup

During its initial warmup iterations, Stan tries to find the high probability mass region in which it should be sampling, adapt a good step size, and estimate posterior variance. The variance is used to precondition the sampler through scaling; see (Neal 2011) for details. Stan can also estimate a dense covariance matrix and precondition with rotation and scaling (i.e., with a Euclidean metric).

Warmup converges when the step size and posterior covariance estimates no longer change. With multiple chains, it’s possible to test that they have all converged to roughly the same step size and covariance estimate. Unless things are going wrong, we typically don’t bother measuring whether adaptation has converged and instead measure our end goal directly, which is whether we are getting reasonable posterior draws after warmup.

Warmup doesn’t form a single coherent Markov chain because it uses memory to adapt. Once Stan starts sampling, the result is a Markov chain. All of our posterior analysis will be with draws from the Markov chain, not from warmup. We can save and extract the warmup draws to investigate the behavior of warmup.

6.2 Potential scale reduction and \(\widehat{R}\)

Stan uses the potential scale reduction statistic \(\widehat{R}\) (pronounced “R hat”). Given a sequence of Markov chains, Stan splits each of them in half to make sure the first half and second half of the chain agree, then calculates variances within each chain and across all chains and compares. The statistic \(\widehat{R}\) converges to 1 as the Markov chains converge to the same distributions.

6.3 How many chains for how long?

A simple rule of thumb is to run four chains until \(\hat{R} \leq 1.01\) and effective sample size (ESS) is greater than 100. The reason we recommend an effective sample size of “only” 100 is that it means the standard error will be \(\frac{1}{10}\) the size of the standard deviation. Given that posterior standard deviation represents residual uncertainty, calculating means to higher precision is rarely worthwhile.

The easiest way to achieve \(\widehat{R} \leq 1.01\) and \(N^{\textrm{eff}} > 100\) is to start with 100 warmup iterations and 100 sampling iterations. If there are \(\widehat{R}\) values that are too large or there are effective sample size values that are too low, then double the number of sampling and warmup iterations, and try again. Running more warmup iterations is important because sampling will not be efficient if warmup has not converged. At most, using the same number of warmup and sampling iterations costs a factor of two over the optimal settings, which are not known in advance.

Even if we use more than four chains, we need to make sure that our effective sample size is at least 25 per chain. It is not that we need so many draws for inference, but that we do not trust our effective sample size estimator if it is much lower. One way to check that the ESS estimator is OK is to double the number of draws and make sure that the ESS also doubles. If it doesn’t, it’s a sign that the first ESS estimate is unreliable.

6.4 Running chains concurrently

You can set the number of chains that are run using the chains argument of the sample() method and you can set the maximum number of chains to execute concurrently using parallel_cores (which defaults to 1, sequential execution). If you set the maximum number of parallel chains to be too low, CPU resources are potentially unused. If you set the number too high, then either CPU or memory can bottleneck performance and cause it to be slower overall than running with fewer chains. The only advice I can give here is to experiment. In personal projects on our own hardware, the goal is usually the largest effective sample size in the minimum amount of time. On the other hand, I sometimes find I need to leave enough processing power left over to continue to work on documents, email, etc. In a server setting, memory usage, latency, throughput, and I/O need to be balanced even more carefully between throughput and resource usage.

7 Stan example: A/B testing

A common application of statistics is to compare two things, such as the effectiveness of a new drug versus the current drug used to treat a condition. Another application might compare the effectiveness of two different advertisement presentations in getting users to click through. This is usually called A/B testing in a nod to comparing a hypothetical option A and option B.

Let’s consider three good Mexican restaurants in New York City, Downtown Bakery II in the East Village, Taqueria Gramercy in Gramercy, and La Delicias Mexicanas in Spanish Harlem. Here’s the number of reviews and 5-star reviews for these restaurants on the web site Yelp.

name 5-star reviews total reviews
Downtown Bakery II 141 276
Taqueria Gramercy 84 143
La Delicias Mexicanas 41 87

We can estimate a few things. First, we can estimate the probability that each restaurant really is a 5-star restaurant. We will parameterize this directly with a rate of 5-star reviews parameter for each restaurant. Then we can rank the restaurants based on their probability of being a 5-star restaurant. What does it mean to “be a 5-star restaurant” in this sense? It means getting 5-star reviews from Yelp reviewers. Our model is going to treat the reviewers as exchangeable in the sense that we don’t know anything to distinguish them from one another. Now whether this notion of 5-star restaurant is useful will depend on how similar the reader is to the population of reviewers.

We will assume that the number of 5-star reviews for a restaurant \(k \in 1{:}K\) depends on its underlying quality \(\theta_k \in [0, 1]\), \[ n_k \sim \textrm{binomial}(N_k, \theta). \] Here \(N_k \in \mathbb{N}\) is the number of reviews for restaurant \(k\) and \(n_k \in 0{:}N_k\) the number of 5-star reviews. We further assume the probabilities of 5-star reviews have a beta distribution, \[ \theta_k \sim \textrm{beta}(\alpha, \beta). \] In a beta distribution, the sum \(\alpha + \beta\) determines how much to regularize estimates toward \(\alpha / (\alpha + \beta)\), with \(\alpha = \beta = 1\) providing a uniform distribution.

For inference, we will be interested in the posterior distribution \(p(\theta \mid N, n)\) of 5-star review probabilities. This gives us the posterior density of the restaurants’ probability of receiving a 5-star review. With this posterior, we can rank restaurants based on their probability of receiving a 5-star review and calculate the probability that each is the best restaurant, \[ \Pr[\Theta_k = \max(\Theta) \mid n, N]. \]

7.1 Stan model for A/B testing

We will assume there are a total of \(K\) items being compared. In traditional A/B testing, \(K = 2\), but our example uses \(K = 3\) and our code works for any \(K\). We define the indicator array for our event probability estimates in the generated quantities block.

stan/ab-test.stan
data {
  int<lower=0> K;
  array[K] int<lower=0> trials;
  array[K] int<lower=0> successes;
  real<lower=0> alpha;  
  real<lower=0> beta;  
}
parameters {
  array[K] real<lower=0, upper=1> theta;
}
model {
  successes ~ binomial(trials, theta);
  theta ~ beta(alpha, beta);
}
generated quantities {
  array[K] int<lower=0, upper=1> is_best;
  for (k in 1:K) {
    is_best[k] = theta[k] == max(theta);
  }
}

We have coded the data for \(K\) items directly in terms of number of trials and number of successes. We have also supplied the prior for the probabilities of success as data as alpha and beta. The model is the same binomial as we had before, except now the likelihood and priors are vectorized. In general, Stan is able to take something like the binomial distribution, which has an integer number of trials, integer number of successes, and a scalar success probability and take vectors for all of these. What we have written above is identical to what we would get with a loop,

  for (k in 1:K) {
    successes[k] ~ binomial(trials[k], theta[k])
  }

The vectorization of theta is different in that only theta is an array, whereas alpha and beta are scalars. The sampling statement for theta is equivalent to

  for (k in 1:K) {
    theta[k] ~ beta(alpha, beta);
  }

Because alpha and beta are scalars, they are not indexed. Rather, they are broadcast, meaning that the same alpha and beta is reused for each dimension of theta.

So now let’s call and fit this model and print the summary. We are setting \(\alpha = \beta = 2\), which is equivalent to setting them equal to 1 and adding 2 trials and 1 success to the data for each restaurant.

M = 100 if DRAFT else 1000
model = csp.CmdStanModel(stan_file = '../stan/ab-test.stan')
data = {'K': 3, 'trials': [276, 143, 87],
        'successes': [141, 84, 41],
        'alpha': 2, 'beta': 2}
sample = model.sample(data = data, seed = 123,
                      iter_warmup = M, iter_sampling = M,
                      show_progress = False, show_console = False)
sample.summary(sig_figs = 2)
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -360.000 0.02600 1.200 -360.00 -360.00 -360.00 2100.0 26000.0 1.0
theta[1] 0.510 0.00048 0.030 0.46 0.51 0.56 3813.0 46501.0 1.0
theta[2] 0.590 0.00062 0.040 0.52 0.59 0.65 4184.0 51030.0 1.0
theta[3] 0.470 0.00071 0.051 0.39 0.47 0.56 5185.0 63230.0 1.0
is_best[1] 0.065 0.00440 0.250 0.00 0.00 1.00 3074.0 37489.0 1.0
is_best[2] 0.900 0.00540 0.300 0.00 1.00 1.00 3035.0 37015.0 1.0
is_best[3] 0.034 0.00310 0.180 0.00 0.00 0.00 3393.0 41377.0 1.0

First, we make sure that our \(\widehat{R}\) values are less than 1.01 and that our \(N^{\textrm eff}\) are all greater than 25 per chain (4 chains here with default settings). In fact, we can see that our sampling is nearly as efficient as if we had independent posterior draws. The probability of a 5-star review is our value for theta, which are 51% for Downtown Bakery, 59% for Taqueria Gramercy, and 47% for La Delicias. Even so, we see that the probability that Taqueria Gramercy is the most likely to have their next review be a 5-star review, it’s still only 90% that’s the case. This is because binomial data is weak and we only have observations in the hundreds. (Editor: Don’t listen to Yelp—La Delicias is the best of these restaurants by far.)

8 Stan’s execution model

Stan programs consist of several blocks. Here is a summary of the blocks, when they are executed, and what they do. None of the blocks are required, but the blocks that do appear must be in the order listed here.

block executed behavior
functions as needed user-defined function definitions
data once read data to construct model
transformed data once define transformed data
parameters once / log density constrain parameters (w. Jacobian)
transformed parameters once / log density define transformed parameters
model once / log density evaluate model log density
generated quantities once / draw define generated quantities

8.1 Data and transformed data

The data block contains only variable declarations. The variables declared in the data block are read once at data load time.

The transformed data block contains variable declarations and definitions. It is executed once, after the data is read is in, to define the transformed data variables. It defines functions of the data, such as standardized predictors, constants to use for priors, etc. The transformed data block may use pseudorandom number generation.

All variable declarations at the block level must declare types and sizes (which may be data dependent). Local variables within blocks are declared without sizes.

Constraints in the data block are evaluated at the end of the block’s execution. Any violations throw exceptions which terminate execution.

Transformed data variables may be assigned in the transformed data block, but after these blocks have executed, the variables may not be reassigned.

8.2 Parameters and transformed parameters

The parameters block declare the variables over which the model is defined. It contains only variable declarations with sizes. When executed, it is supplied with concrete parameter values, which it transforms to the unconstrained scale based on the declared constraints (if any).

Constraints on parameters are used to define a transform of the constrained space to \(\mathbb{R}^D\). For example, a variable declared with a lower=0 constraint is log-transformed to an unconstrained variable, whereas a variable declared as a covariance matrix is Cholesky factored and then a log transform is applied to the diagonal elements to render an \(N \times N\) matrix as a vector in \(\mathbb{R}^{\binom{N}{2}}\). It is critical that parameters are declared with all necessary constraints so that the model has support (i.e., non-zero density, finite log density) over the entire transformed space.

The transformed parameters block defines functions of parameters and data. Any constraints declared on transformed parameters are validated at the end of the block’s execution. If the constraints are violated, an exception will be thrown, which typically causes the current proposal to be rejected.

Variables declared in the parameters block are like function argument declarations. The log density function defined by a Stan program takes the parameters as an argument. Thus the parameter values are always supplied externally to a Stan program.

After the transformed parameters block has executed, variables declared in it may not be reassigned.

The only difference between variables declared as local variables in the model block and those declared in the transformed parameters block is that the transformed parameters are printed and also available in the generated quantities block.

8.3 Model

The model block defines the log density and is evaluated once per log density evaluation. The log density starts with any components added by the log Jacobian determinants of the inverse transforms of the parameters from the unconstrained to the constrained space. The model is defined using constants or modeled data from the data or transformed data block, and parameters or transformed parameters.

The target log density that will be returned may be incremented directly, e.g.,

target += -0.5 * x^2;

The current value of the target is available as target(); printing it can be useful for debugging.

Sampling statements are just syntactic sugar for incrementing the target log density. For example, the sampling statement

x ~ normal(0, 1);

is equivalent to the target increment statement

target += normal_lupdf(x | 0, 1);

The vertical bar is used to separate outcome variates from the parameters. The notation lpdf denotes a log probability density function, with lpmf for log probability mass functions. The variants lupdf and lupmf are for their unnormalized counterparts, which may drop normalizing constants that do not depend on the parameters. Unless the normalizing constants are needed, for example in a mixture model component, it is more efficient to use the lupdf and lupmf forms either explicitly or implicitly through sampling statements.

8.4 Generated quantities

The generated quantities block is evaluated once per draw rather than once per log density evaluation. With algorithms like Hamiltonian Monte Carlo, each draw may require a few, dozens, or even hundreds of log density evaluations. The further advantage of generated quantities is that they are evaluated with double-precision floating point values because they do not require differentiation, which is a factor of four or more faster than autodiff. The generated quantities block may also use pseudo-random number generation. Any constraints are evaluated at the end, but exceptions do not cause rejections, just potential warnings and potentially undefined (not-a-number) values.

Generated quantities do not contribute to the definition of the log density being sampled, but they are nevertheless part of the statistical model. The typical role of generated quantities is for posterior predictive inference, which is conditionally independent of the observed data given the model parameters.

8.5 User-defined functions

Users can define functions in the functions block. These can be applied anywhere in a Stan program, just like built-in functions. In addition to ordinary mathematical functions, users can define several types of special-purpose functions with Stan-specific behavior.

User-defined probability density and mass functions

Defining a function with the suffix _lpdf defines a continuous log probability density function where the first variable is a real- or complex-valued variate (these may be containers like matrices).

Similarly, the suffix _lpmf marks a discrete log probability density function and Stan will validate that the first argument is declared as an integer.

These functions are called using standard vertical bar notation to separate the variate and its parameters and they may also be used in sampling statements. For example, consider the following custom log probability density function.

functions {
  real foo_lpdf(real y, real theta) {
    return normal_lpdf(y | 0, exp(-theta))
  }
}

With this definition, the model block may use the function to increment the target log density,

  target += foo_lpdf(z | phi);

User-defined functions ending in _lpdf may also be used directly in sampling statements after removing the suffix.

  z ~ foo(phi);

A user-defined _lpdf function implicitly defines an equivalent _lupdf function so that user-defined _lpdf functions may be used with sampling notation. User-defined functions ending in _lupdf or _lupmf are not allowed.

User-defined random number generators

Only functions defined with _rng suffixes will be able to call other functions with _rng suffixes and they will only be able to be used in Stan programs in the transformed data and generated quantities blocks (i.e., the blocks that are not part of the model definition and hence do not need to be differentiated).

For example, a simple way to generate chi-squared random variates is to literally sum a sequence of squared normal variates (in a real program, one would use the the built-in chi-square _rng function).

real simple_chi_sq_rng(int n) {
  real y = 0;
  for (i in 1:n) {
    y += normal_rng(0, 1)^2;
  }    
  return y;
}

Modifying the target

Functions that use the suffix _lp can access and modify the log density. This means they can only be used in transformed parameters or model blocks. For example, the following function implements what Stan does implicitly for a variable declared with a lower=0 constraint.

real pos_constrain_lp(real v) {
  target += v;  // change of variables adjust: log abs(d/dv exp(v))
  return exp(v);  // change of variables
}

8.6 Automatic differentiation

Stan uses automatic differentiation to define gradients of the log density function. This is done by building up the complete expression graph of the log density being defined by the Stan model. A concrete value for unconstrained parameters is supplied, these are constrained and the log absolute determinant of the Jacobian of the unconstraining transformed is added to the expression graph. Then each operation involving parameters is added to the expression graph. The root is the final log density value, which is the value of target. The leaves are the unconstrained parameters. A reverse pass over this expression graph propagates the partial derivatives from the log density value down to the unconstrained parameters using the chain rule (it applies in the reverse order of the expression construction, which is a topological sort of the directed graph). See Carpenter et al. (2015) for details of Stan’s C++ architecture for automatic differentiation; current details are only in the developer documentation and code repository.

9 Stan example: Regression and prediction

In this section, we will go over simple regression models in Stan and see how to generate predictions for new items based on fitted parameters.

9.1 Fisher’s iris data set

We will analyze Fisher’s classic iris data set, which provides sepal and petal length and width (in centimeters) as well as the species of iris. First we read it in, then we will plot petal width versus petal length, with species indicated by color.

df = pd.read_csv('../stan/iris-data.csv')
mydraw(
  pn.ggplot(df, pn.aes(x='petal_width', y='petal_length',
                       color='species')) +
  pn.geom_point() +
  pn.labs(x = "petal width (cm)", y = "petal length (cm)")
)

The three species are of very different sizes, but there is a roughly linear relation between petal width and petal length across the different species.

The width values all have exactly one decimal place of accuracy, which means they were recorded to the nearest millimeter. There are models that deal with this kind of measurement quantization, but we will not consider them now and will instead proceed as if there were no rounding of our measurements.

9.2 Linear regression

A simple univariate linear regression models one measurement as a linear function of another measurement. In this example, we will model an iris flower’s petal length as a linear function of its petal width. If we let \(x_n\) be the petal width for flower \(n\) and let \(y_n\) be the petal length. A linear regression says that we expect \(y_n\) to be a linear function of \(x_n\), or roughly, the expected value of \(y_n\) will be \(\alpha + \beta \cdot x_n\) for some intercept \(\alpha\) and slope \(\beta\).

The linear relation in a linear regression only holds in expectation. That is, we expect the value of \(y_n\) to be \(\alpha + \beta \cdot x_n\), but in real observations there will be error due to either measurement or a failure of the linear relationship. In the Iris data plot, you can see that there are multiple irises with exactly the same measured petal width but varying petal lengths.

We can introduce a variable \(\epsilon_n\) for the difference between a petal’s length and its expected length given the linear relationship, \[ y_n = \alpha + \beta \cdot x_n + \epsilon_n. \] It is traditional to assume the error is normally distributed with a zero mean and scale \(\sigma > 0\), \[ \epsilon_n \sim \textrm{normal}(0, \sigma). \] We can rearrange terms and write this in a fashion that will mirror how it’s coded in Stan, \[ y_n \sim \textrm{normal}(\alpha + \beta \cdot x_n, \sigma). \] In this form, it’s clear that the linear prediction is the expectation for the random variable \(Y_n\) and the standard deviation is the scale, \[ \mathbb{E}[Y_n] = \alpha + \beta \cdot X_n \qquad \mathrm{sd}[Y_n] = \sigma. \]

Here is a simple Stan model to regress petal length on petal width; we replace \(x\) with petal_width and \(y\) with petal_length.

../stan/iris-petals.stan
data {
  int<lower=0> N;
  vector<lower=0>[N] petal_width;
  vector<lower=0>[N] petal_length;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  petal_length ~ normal(alpha + beta * petal_width, sigma);
  alpha ~ normal(0, 5);
  beta ~ normal(0, 5);
  sigma ~ lognormal(0, 1);
}

The data block says there are \(N\) observations of petal width and length. The widths and lengths are declared to have type (column) vector, with a constraint lower=0 and size (number of rows) N. We are declaring these types as vectors because we are going to apply vector arithmetic to them.

The model parameters are coded in the parameter block. The lower bound for sigma is required here as the normal distribution is only defined for positive values of sigma. Stan requires every parameter value that satisfies the declared constraints to be in the support of the density, which is the set of values for which the density is non-zero (equivalently, the log density is finite).

The model codes the regression following the math, but without the subscripts. What’s going on is that the sampling statement for petal_length is vectorized. The variables petal_length and petal_length are both vectors of size N, whereas alpha, beta, and sigma are all scalars. Working outward, beta * petal_width is a scalar times a vector, which is defined in the usual way as a vector whose nth entry is beta * petal_width[n]. We then add alpha to that result (multiplication binds more tightly than addition and as in mathematical writing, we drop all unnecessary parentheses). This is defined by broadcasting so that alpha acts like a vector of size N and addition is defined in the usual way for vectors. As a result, alpha + beta * petal_width is a vector of size N, whose nth element is alpha + beta * petal_width[n]. Now we have a vector for the location parameter of a normal distribution, a scalar scale (sigma), and a vector outcome (petal_length). This works by broadcasting sigma to use for each n. The end result is equivalent to the following, but much more compact and efficient.

for (n in 1:N) {
  petal_length[n] ~ normal(alpha + beta * petal_width[n], sigma);
}

The rest of the model is priors for the regression coefficients and error scale. The coefficients are unconstrained, but the error scale is constrained to be positive, so we use a lognormal distribution (we could have also used a half normal and we should if scales near zero are possible).

We can then compile and fit the model and display the resulting fit for alpha and beta as a scatterplot.

def iris_data_frame():
    return pd.read_csv('../stan/iris-data.csv')

def iris_data():
    df = iris_data_frame()
    N = df.shape[0]
    petal_width = df['petal_width']
    petal_length = df['petal_length']
    species = 1 + pd.Series(df['species']).astype('category').cat.codes
    num_species = 3
    data = {'N': N,
            'K': num_species,
            'species': species,
            'petal_width': petal_width,
            'petal_length': petal_length,}
    return data

M = 100 if DRAFT else 1000
model = csp.CmdStanModel(stan_file = '../stan/iris-petals.stan')
sample = model.sample(data = iris_data(), seed = 123,
                      iter_warmup = M, iter_sampling = M,       
                      show_progress = False, show_console = False)
sample.summary(sig_figs = 2)
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ 34.00 0.0350 1.300 31.00 34.00 35.00 1400.0 7400.0 1.0
alpha 1.10 0.0020 0.077 0.96 1.10 1.20 1500.0 7800.0 1.0
beta 2.20 0.0014 0.054 2.10 2.20 2.30 1400.0 7200.0 1.0
sigma 0.48 0.0006 0.029 0.44 0.48 0.53 2300.0 12000.0 1.0

This summary doesn’t give us a good feeling for the uncertainty in the linear relationship. To do that, we will plot multiple draws from the posterior along with the data.

alpha_draws = sample.stan_variable('alpha')
beta_draws = sample.stan_variable('beta')
plot =  pn.ggplot(df, pn.aes(x='petal_width', y='petal_length',
                             color='species'))
plot = plot +  pn.geom_point()
plot = plot +  pn.labs(x = "petal width (cm)", y = "petal length (cm)")
for a, b in zip(alpha_draws[0:20], beta_draws[0:20]):
    plot = plot + pn.geom_abline(intercept = a, slope = b,
                                 alpha = 0.5, size = 0.2)
mydraw(
  plot
)

In order to fit all three groups of data, the plot doesn’t do such a great job of fitting any of them—the larger instances among iris setosa examples and the larger instances among the irs versicolor have underestimated petal lengths.

We didn’t include the error scale, but it’s roughly 0.5. That means it won’t be uncommon to get errors greater than 1 or less than -1. For iris setosa, this could easily result in predictions with negative lengths!

9.3 Built-in generalized linear model functions

Stan supplies built-ins for the most common generalized linear model (GLM) functions. These allow linear and non-linear regressions to be coded easily and efficiently. For example, our linear regression could be coded with the normal_id_glm function and a logistic regression coded with the bernoulli_logit_glm. The linear regression for the Iris data could be written as follows

  petal_length ~ normal_id_glm(petal_width, alpha, beta, sigma);

if the data and parameter type declarations are generalized to

  matrix<lower=0>[N, 1] petal_width;
  vector[N] petal_length;
  real alpha;
  vector[1] beta;
  real<lower=0> sigma;

The generalization of a vector of predictors to a matrix allows multivariate regressions with multiple predictors per item (e.g., using population density, speed limit, traffic density, and time of day to predict pedestrian traffic accidents).

9.4 Robust regression

Although we won’t consider this model in detail, Stan’s plug-and-play design makes it very simple to convert our linear regression into a robust regression by swapping out the normal error model for a Student-t error model,

  petal_length ~ student_t(dof, alpha + beta * petal_width, sigma);

We have just used dof for the degrees of freedom variable, but setting it at a value like 4 provides wider tails and setting it at 1 produces the very wide-tailed Cauchy distribution, which does not even have a finite mean or variance.

9.5 Posterior predictive inference

Now let’s say we have a new observation where we know the petal width and want to predict its length. Mathematically, to make a prediction for a new item, we use posterior predictive inference.

First, let’s consider evaluating the log density of a new petal’s length (\(\tilde{y}\)) given its observed width (\(\tilde{x}\)), having observed our original data \(x\) and \(y\). In Bayesian inference, we evaluate the posterior predictive distribution, where we let \(\Theta\) be our random parameters with realization \(\theta\), \[\begin{align} p(\tilde{y} \mid \tilde{x}, x, y) &= \mathbb{E}[p(\tilde{y} \mid \tilde{x}, \theta) \mid x, y] \\[6pt] &= \int_{\Theta} p(\tilde{y} \mid \tilde{x}, \theta) \cdot p(\theta \mid x, y) \, \textrm{d}\theta \\[6pt] &\approx \frac{1}{M} \sum_{m=1}^M \, p(\tilde{y} \mid \tilde{x}, \theta^{(m)}), \end{align}\] where \(\theta^{(1)}, \ldots, \theta^{(m)} \sim p(\theta \mid x, y)\) are draws from the posterior. The parameters are marginalized out. It’s perhaps easiest to understand by considering the Monte Carlo approximation, which just averages the sampling log density over draws of parameters from the posterior.

Sampling uncertainty and estimation uncertainty

It can help to break the integral down into the two components of uncertainty, sampling uncertainty due to our sampling distribution and posterior uncertainty in the values of our parameters, \[ \int_{\Theta} \underbrace{p(\tilde{y} \mid \tilde{x}, \theta)}_{\textrm{sampling uncertainty}} \cdot \underbrace{p(\theta \mid x, y)}_{\textrm{estimation uncertainty}} \, \textrm{d}\theta \] In our case, \(\theta = \begin{bmatrix}\alpha & \beta & \sigma\end{bmatrix}\) and the sampling distribution is \[ p(\tilde{y} \mid \tilde{x}, \alpha, \beta, \sigma) = \textrm{normal}(\tilde{y} \mid \alpha + \beta \cdot \tilde{x}, \sigma). \] Even if we know the parameter values \(\alpha\), \(\beta\), and \(\sigma\) exactly, we can only predict \(\tilde{y}\) to within \(\epsilon\), where \(\epsilon \sim \textrm{normal}(0, \sigma)\). If we plug in a point estimate \(\widehat{\alpha}, \widehat{\beta}, \widehat{\sigma}\) for our parameters, we might get approximate inference that takes into account sampling uncertainty, but not estimation uncertainty, e.g., \[ p(\tilde{y} \mid \tilde{x}, x, y) \approx p(\tilde{y} \mid \tilde{x}, \widehat{\alpha}, \widehat{\beta}, \widehat{\sigma}). \]

So far, this only gives us a way to evaluate the log density of a resulting outcome \(\tilde{y}\) given a predictor \(\tilde{x}\). If we want to simulate possible \(\tilde{y}\), we have to make sure to add sampling uncertainty and draw \[ \tilde{y}^{(m)} \sim \textrm{normal}\!\left(\alpha^{(m)} + \beta^{(m)} \cdot \tilde{x} \mid \sigma^{(m)}\right), \]

where \(\alpha^{(m)}, \beta^{(m)}, \sigma^{(m)} \sim p(\alpha, \beta, \sigma \mid x, y)\) are posterior draws. In general, if we then want to estimate \(\tilde{y}\), we can take posterior means of these values. In the case here, our sampling distribution is symmetric, so that the expectation of the random draws and the expectation of their mean is identical. If our only goal is to estimate an expectation, it is better to average expectations than to average random draws with those expectations. This is formalized in the Rao-Blackwell theorem. Converting an estimator that uses a sample to one that uses expectations of that sample is called “Rao-Blackwellization.”

Standalone generated quantities

Let’s put all this together into a Stan program. First, let’s make posterior predictions for some new \(\tilde{y}\). We can do this using Stan’s standalone generated quantities feature that lets us run the generated quantities block of a new model given draws from another model. We could’ve also just included the generated quantities block in the original program.

iris-predict.stan
data {
  int<lower=0> N_tilde;
  vector<lower=0>[N_tilde] petal_width_tilde;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
generated quantities {
  vector<lower=0>[N_tilde] E_petal_length_tilde
    = alpha + beta * petal_width_tilde;
  vector<lower=0>[N_tilde] petal_length_tilde
    = to_vector(normal_rng(E_petal_length_tilde, sigma));
}

This program declares two new data variables, N_tilde for the number of predicted items, and petal_width_tilde, a vector of petal widths of size N_tilde.

It is important that the program used for standalone generated quantities use exactly the same parameters as the original program. After our original fit, we read that sample back in for the parameters, then run generated quantities (and the transformed parameter block if there is one).

The generated quantities block first calculates the expected petal length given the petal width and assigns it to the variable E_petal_length_tilde. This code is vectorized so that it calculates all of the input petal widths at once. The second variable is then set by sampling according to the sampling distribution using the normal_rng function. This function is also vectorized, but it returns an array, so we convert it to a vector just to keep all the types the same.

data = {'N_tilde': 3,
        'petal_width_tilde': [0.4, 1.75, 3.8]}
model = csp.CmdStanModel(
     stan_file = '../stan/iris-posterior-predictive-sim.stan')
pps_sample = model.generate_quantities(data = data, seed = 123,
                                       previous_fit = sample,
                                       show_console = False)
print('Estimated petal lengths given petal widths')
for i in range(3):
  length_draws = \
      pps_sample.stan_variable('petal_length_tilde')[0:100, i]
  E_length_draws = \
      pps_sample.stan_variable('E_petal_length_tilde')[0:100, i]
  print(f"petal_width_tilde[{i}] = {data['petal_width_tilde'][i]}")
  print(f"  mean(E_petal_length_tilde[{i}]) = {np.mean(E_length_draws):.2f}")
  print(f"  sd(E_petal_length_tilde[{i}]) = {np.std(E_length_draws):.2f}")
  print(f"  mean(petal_length_tilde[{i}]) = {np.mean(length_draws):.2f}")
  print(f"  sd(petal_length[{i}]) = {np.std(length_draws):.2f}\n")
Estimated petal lengths given petal widths
petal_width_tilde[0] = 0.4
  mean(E_petal_length_tilde[0]) = 1.98
  sd(E_petal_length_tilde[0]) = 0.07
  mean(petal_length_tilde[0]) = 1.88
  sd(petal_length[0]) = 0.45

petal_width_tilde[1] = 1.75
  mean(E_petal_length_tilde[1]) = 4.98
  sd(E_petal_length_tilde[1]) = 0.05
  mean(petal_length_tilde[1]) = 4.95
  sd(petal_length[1]) = 0.52

petal_width_tilde[2] = 3.8
  mean(E_petal_length_tilde[2]) = 9.54
  sd(E_petal_length_tilde[2]) = 0.16
  mean(petal_length_tilde[2]) = 9.51
  sd(petal_length[2]) = 0.48

For each of our input petal widths \([0.4 \quad 1.75 \quad 3.8]\), we see the posterior mean prediction for petal width and its standard deviation calculated two ways. First, we take the posterior draws for the expected petal length, E_petal_length_tilde, which is just the linear prediction of petal value. The uncertainty comes only from estimation uncertainty in \(\alpha\) and \(\beta\). Second, we take the posterior draws for petal length, which include an additional normal error term with scale \(\sigma\). The means are roughly the same either way, but the standard deviation is much higher in the case of sampling. The second variable, petal_length_tilde, includes both sources of posterior uncertainty, estimation uncertainty and uncertainty from the sampling distribution.

9.6 Lognormal and transformed data

In this section, we will convert our iris model to the log scale using the transformed data block to convert the petal width to a log scale and use a lognormal regression on the width to predict the length.

The lognormal distribution is a simple transform of a normal distribution with support over positive values. If \[ U \sim \textrm{lognormal}(\mu, \sigma), \] then \[ log U \sim \textrm{normal}(\mu, \sigma). \] This will allow us to transform our Stan program. We first translate the petal width to the log scale in the transformed data block, then model length as a lognormal regression on log width in the model block.

../stan/iris-petals-log.stan
data {
  int<lower=0> N;
  vector<lower=0>[N] petal_width;
  vector<lower=0>[N] petal_length;
}
transformed data {
  vector[N] log_petal_width = log(petal_width);
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  petal_length ~ lognormal(alpha + beta * log_petal_width, sigma);
  alpha ~ normal(0, 3);
  beta ~ normal(0, 3);
  sigma ~ lognormal(0, 0.6);
}

We have introduced a new block for transformed data, in which we have defined a vector of log petal widths to use as a predictor. The petal lengths are already positive and we do not modify those. The effect is that we are regressing log petal length on log petal width and the sampling statement could have been replaced with

  log(petal_length) ~ normal(alpha + beta * log_petal_width, sigma);

The result would be the same up to a constant normalizing term involving petal length (which is constant) in the lognormal density.

Let’s run the model and summarize the results.

M = 100 if DRAFT else 1000
log_model = csp.CmdStanModel(stan_file =
                               '../stan/iris-petals-log.stan')
log_sample = log_model.sample(data = iris_data(), seed = 123,
                              iter_warmup = M, iter_sampling = M,      
                              show_progress = False,
                  show_console = False)
log_sample.summary(sig_figs = 2)
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ 56.00 0.02800 1.2000 54.00 56.00 57.00 2000.0 14000.0 1.0
alpha 1.30 0.00022 0.0130 1.30 1.30 1.30 3700.0 26000.0 1.0
beta 0.57 0.00022 0.0140 0.55 0.57 0.60 3800.0 27000.0 1.0
sigma 0.16 0.00017 0.0096 0.15 0.16 0.18 3100.0 22000.0 1.0

The estimated values ae \(\widehat{\alpha} = 1.3\) and \(\widehat{\beta} = 0.57\) and \(\sigma = 0.16\). One advantage of the log scale is that the regression makes more sense because \(\exp(\alpha + \beta \cdot \log w) = \exp(\alpha) \cdot w^{\beta}\). By taking exponents, the intercept \(\alpha\) becomes a multiplicative factor \(\exp(\alpha)\) and the slope \(\beta\) becomes an exponent. The error also moves to the multiplicative scale, which makes it relative. Multiplicative error makes sense here as we wouldn’t expect the same error scale for petals of width 0.5cm as for petals of width 5cm. The final relationship derived through this model is as follows. \[\begin{align} \textrm{length} &= \exp(\alpha + \beta * \log(\textrm{width}) \ \pm \ 2 \cdot \sigma) \\[6pt] &= 3.67 \cdot \textrm{width}^{0.56} \cdot 1.38^{\pm 1}. \end{align}\] The plus or minus 1 on the exponent turns into either multiplication or division by 1.38.

The posterior intervals are tighter for the lognormal model, which generally indicates a better fit to data. Let’s see what the data and twenty posterior draws of the regression look like on the log scale.

df = iris_data_frame()
df['log_petal_length'] = np.log(df['petal_length'])
df['log_petal_width'] = np.log(df['petal_width'])
alpha_draws = log_sample.stan_variable('alpha')
beta_draws = log_sample.stan_variable('beta')
plot =  pn.ggplot(df, pn.aes(x='log_petal_width',
                             y='log_petal_length',
                 color='species'))
plot = plot +  pn.geom_point()
plot = plot +  pn.labs(x = "petal width (log cm)",
                       y = "petal length (log cm)")
for a, b in zip(alpha_draws[0:20], beta_draws[0:20]):
    plot = plot + pn.geom_abline(intercept = a, slope = b,
                                 alpha = 0.5, size = 0.2)
mydraw(plot)

Now the iris versicolor and iris virginica look good, but the smaller iris setosa is still poorly characterized. We’ll fix this problem in the next section.

9.7 Multi-indexing: varying slopes and varying intercepts

We can see from the previous section that a single regression line doesn’t fit all three species of iris well. While the lognormal model is better than the linear model, it still fails to capture the characteristics of the smaller iris setosa.

On the log scale, it does look like it will be possible to capture iris setosa’s scale, as long we can build separate regressions for each species. The varying slopes and varying intercepts are sometimes called random effects, in contrast to the previous models’ fixed effects, which do not vary by data grouping.

Mathematically, we now have three intercepts three slopes, one pair for each species of iris. We will represent these as 3-vectors, \(\alpha, \beta \in \mathbb{R}^3\). Given our \(N\) data items, we will introduce an indexing data item \(\textrm{species} \in \{ 1, 2, 3 \}^N\), where \(\textrm{species}_n\) is the species of the \(n\)-th data item. Sticking to the log scale, our regression now looks like this, \[ \textrm{length}_n \sim \textrm{lognormal}(\alpha_{\textrm{species}_n} + \beta_{\textrm{species}_n} \cdot \log \textrm{width}_n, \sigma). \] The value \(\alpha_{\textrm{species}_n}\) is the intercept for \(\textrm{species}_n \in \{1, 2, 3\}\). We have kept the same error term \(\sigma\), though we could have also split that out on a per-species basis like the slope and intercept.

The Stan code follows the statistical notation.

../stan/iris-petals-varying.stan
data {
  int<lower=0> N;
  vector<lower=0>[N] petal_width;
  vector<lower=0>[N] petal_length;
  int<lower=0> K;
  array[N] int<lower=1, upper=K> species;
}
transformed data {
  vector[N] log_petal_width = log(petal_width);
}
parameters {
  vector[K] alpha;
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  petal_length ~ lognormal(alpha[species] + beta[species] .* log_petal_width,
                           sigma);
  alpha ~ normal(0, 3);
  beta ~ normal(0, 3);
  sigma ~ lognormal(0, 0.6);
}

Here, we take \(K\) to be the number of species and the species indicator thus takes on values between \(1\) and \(K\) (inclusive). Where before we had alpha and beta for the intercept and slope, we now have alpha[species] and beta[species]. These will pick our vectors of size \(N\) (number of data items), as explained below. Further, the product of the slope and width is now an elementwise product (.*), which we also consider below. The sampling statement above is equivalent to

for (n in 1:N) {
  petal_length[n]
    ~ lognormal(alpha[species[n]] + beta[species[n]] * log_petal_width[n],
                sigma);
}

Elementwise products in Stan

The elementwise product (.*) uses MATLAB syntax and is defined so that

(a .* b)[n] == a[n] * b[n]  // true

Elementwise division (./) works the same way and we don’t need elementwise addition or subtraction because those operations are already defined as regular vector addition and subtraction.

Multi-indexing in Stan

This model uses a technique we call multi-indexing and it works the similarly to the way indexing works in NumPy. Suppose we have a vector of size J and an array of indexes of size N.

array[N] int<lower=1, upper=J> idxs;
vector[J] foo;

We can use the multiple indexes in idxs to index foo as foo[idxs]. The result is a size N vector (the type is taken from foo and the size from idxs), with values given by indexing into idxs. Indexing is defined as follows.

cols(foo[idxs]) == N          // true
foo[idxs][n] == foo[idxs[n]]  // true

This tells us the number of columns of foo[idxs] is N and that indexing is done by first indexing into idxs and then using the result.

This works the same way for range indexing, such as foo[1:3], and for indexing with array literals like {7, 9, 3}, e.g.,

foo[3:5] == {foo[3], foo[4], foo[5]}         // true
foo[{7, 9, 3}] == {foo[7], foo[9], foo[3]}   // true

Fitting our varying effects model

First, we add a species column to our data frame to represent the species of the iris as an integer 1, 2, or 3. Then we will just read in the data and fit and summarize the results.

M = 100 if DRAFT else 1000
vary_model = csp.CmdStanModel(
                  stan_file = '../stan/iris-petals-varying.stan')
vary_sample = vary_model.sample(data = iris_data(), seed = 123,
                                iter_warmup = M, iter_sampling = M,     
                                show_progress = False,
                show_console = False)
vary_sample.summary(sig_figs = 2)
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ 130.000 0.04700 1.900 130.000 130.000 130.00 1600.0 2000.0 1.0
alpha[1] 0.490 0.00110 0.052 0.400 0.490 0.57 2276.0 2772.0 1.0
alpha[2] 1.300 0.00058 0.029 1.200 1.300 1.30 2492.0 3035.0 1.0
alpha[3] 1.500 0.00150 0.070 1.400 1.500 1.70 2225.0 2710.0 1.0
beta[1] 0.075 0.00068 0.033 0.021 0.075 0.13 2397.0 2919.0 1.0
beta[2] 0.600 0.00190 0.093 0.440 0.600 0.75 2415.0 2942.0 1.0
beta[3] 0.230 0.00210 0.099 0.073 0.230 0.39 2197.0 2675.0 1.0
sigma 0.100 0.00000 0.000 0.100 0.100 0.10 3152.7 3840.0 1.0

There are a few things worth noting in this summary. First, recall that the parameter estimates in the fixed effects model were \(\widehat{\alpha} = 1.3\), \(\widehat{\beta} = 0.57\), and \(\widehat{\sigma} = 0.16\). In the varying effects model, we see that the estimates for \(\alpha_k\) and \(\beta_k\) vary considerably by \(k\). For example, as we see in the data, there is very little effect of width on length for iris setosa, with \(\widehat{\beta}_1 = 0.075\), wheres there is a large effect for iris versicolor of \(\widehat{\beta}_2 = 0.6\). Furthermore, our value for \(\sigma\) is lower, which indicates lower residual error in our predictors. We will leave evaluating varying error scales to the reader.

We now have three different regression lines, which we plot on the log scale with color matching the data..

df = iris_data_frame()
df['log_petal_length'] = np.log(df['petal_length'])
df['log_petal_width'] = np.log(df['petal_width'])
plot =  pn.ggplot(df, pn.aes(x='log_petal_width',
                             y='log_petal_length',
                 color='species'))
plot = plot +  pn.geom_point()
plot = plot +  pn.labs(x = "petal width (log cm)",
                       y = "petal length (log cm)")
plot = plot + pn.geom_abline(intercept = .49, slope = 0.076, color='red')
plot = plot + pn.geom_abline(intercept = 1.30, slope = 0.59, color='green')
plot = plot + pn.geom_abline(intercept = 1.50, slope = 0.23, color='blue')
mydraw(
  plot
)

We now have a good fit for each of the groups and, as we might expect, a lower error scale \(\sigma\) than for the model with a single regression.

10 Containers: arrays, vectors, and matrices

Stan is strongly typed, and each of the types has a distinct purpose. Constraints act as error checks in the data, transformed data, transformed parameter, and generated quantities blocks. In the parameter block, they are used to transform from unconstrained to constrained values (with an implicit change of variables adjustment). Stan Development Team (2023b) provides complete details or the full set of transforms, inverse transforms, and their log absolute Jacobian determinants.

10.1 Integer and real primitives

The two primitive types are int and real. In the compiled C++ code, these are 32-bit signed integers and 64-bit floating point values, respectively.

The third scalar type is complex, where a complex value consists of a real component and an imaginary component, both of which are represented as 64-bit floating point values.

Constrained primitive types

Integer and real types may be constrained with lower bounds, upper bounds, or both. For example, real<lower=0> is the type for a concentration, real<lower=0, upper=1> is the type for a probability, real<lower=-1, lower=1> is the type for a correlation, int<lower=0> is the type for a count, and int<lower=1, upper=5> is the type for an ordinal response to a survey.

There is a second kind of modifier, which is not technically a constraint, but is a transform. If we declare

vector<offset=mu, multiplier=sigma>[K] alpha;

then the unconstrained representation is multiplied by sigma and then offset by mu to get the constrained value alpha. This is primarily used for non-centered parameterizations of hierarchical models, which are beyond this tutorial.

10.2 Arrays

For any type T, we can write array[N] T for the type of an array of size N containing elements of type T. We can also write array[M, N] T for a 2D array and so on for higher dimensionality. For example, array[3, 2] int is a two-dimensional array of integers of shape 3 by 2.

10.3 Matrices

The type matrix[M, N] is for an \(M \times N\) matrix. The type complex_matrix[M, N] is for an \(M \times N\) matrix with complex values.

Constrained matrix types

There are four special matrix types. The type cov_matrix is for positive definite matrices, whereas cholesky_factor_cov is for Cholesky factors of positive definite matrices. The type corr_matrix is for positive definite matrices with unit diagonals and cholesky_factor_corr is for Cholesky factors of correlation matrices. The Cholesky factors are lower triangular and should be preferred to the full matrices where possible.

10.4 Vectors and row vectors

The type vector[M] is for column vectors with M rows, whereas row_vector[N] is for row vectors with N columns. A column vector is like an \(M \times 1\) matrix and a row vector is like a \(1 \times N\) vector. There are also complex_vector and complex_row_vector types which work the same way.

The reason we do not just represent vectors as matrices is that we are able to reduce types in arithmetic. For example, the product of a row vector and a product vector is a scalar rather than a \(1 \times 1\) matrix. Similarly, the product of a matrix and a vector is a vector.

Constrained vector types

There is a vector type ordered[M] for vectors in ascending order, and a similar type pos_ordered[M] for vectors of non-negative values in ascending order.

The vector type unit_vector[M] is for vectors of unit Euclidean length (i.e., the sum of squared values is one). There is also a simplex[M] type for simplex (non-negative values that sum to one).

Assignment and function calls

Calling a function works like assigning the variables in the argument list to the variables in the function declaration. In general, Stan is what programming language researchers call covariant, meaning that if a type T is assignable to a type U, then a type array[N] T is assignable to array[N] U. Similarly, integer-valued primitives and containers may be assigned to their real or complex counterparts, and real-valued primitives and containers may be assigned to their complex counterparts (but not the other way around). Variables of a constrained type may be assigned to variables of an unconstrained type.

11 Stan Example: Discrete parameters and mixture models

Although Stan only allows parameters which are real or complex valued, it is able to work with models involving discrete parameters by marginalizing them out in the likelihood and working in expectation during inference.

There are two reasons Stan doesn’t allow discrete parameters. First, because Stan isn’t limited to directed graphical models, it’s technically impossible to pull out efficient conditional distributions for the discrete parameters in general (though it can be done in some limited cases). Second, discrete sampling tends not to perform very well because the parameters tend to get locked and not mix well.

11.1 Normal mixture model

We’ll consider a simple two-component normal mixture model parameterized with a discrete responsibility parameter. This model assumes that each item \(y_n\) is drawn from one of two possible normal distributions, \(\textrm{normal}(\mu_1, \sigma_1)\) or \(\textrm{normal}(\mu_2, \sigma_2)\), with a probability \(\lambda \in [0, 1]\) of being drawn from the first distribution. Let’s assume we have \(N\) observations \(y_n \in \mathbb{R}\). We will use \(z_n \in \{ 0, 1 \}\) as the discrete parameter representing the distribution from which \(y_n\) arose. This gives us the following model \[\begin{align} z_n &\sim \textrm{bernoulli}(\lambda) \\[6pt] y_n &\sim \textrm{normal}(\mu_{z_n}, \sigma_{z_n}) \end{align}\] In frequentist settings, the parameters \(z_n\) are sometimes called missing data to finesse a philosophical aversion to probability distributions over parameters.

11.2 Marginalization to the rescue

The immediate problem is that we cannot code this model in Stan by just following the math because we do not have discrete parameters. But what we can do is marginalize out the discrete parameters. We know from the law of total probability that if \(B\) is a discrete random variable, then we can derive the marginal probability \(p(a)\) from the joint probability function \(p(a, b)\) by \[ p(a) = \sum_{b \in B} p(a, b). \]

We can express our mixture model’s sampling distribution over both \(y\) and \(z\), \[ p(y, z \mid \lambda, \mu, \sigma) = \prod_{n = 1}^N p(y_n, z_n \mid \lambda, \mu, \sigma). \] This is called the complete data likelihood in frequentist settings.

Then we can evaluate the likelihood elementwise, \[ p(y_n, z_n \mid \lambda, \mu, \sigma) = \textrm{bernoulli}(z_n \mid \lambda) \cdot \textrm{normal}(y_n \mid \mu_{z_n}, \sigma_{z_n}). \] Now we can marginalize out the \(z_n\) by considering values 0 and 1, \[\begin{align} p(y_n \mid \lambda, \mu, \sigma) &= p(y_n, z_n = 1 \mid \lambda, \mu, \sigma) + p(y_n, z_n = 0 \mid \lambda, \mu, \sigma) \\[6pt] &= \textrm{bernoulli}(1 \mid \lambda) \cdot \textrm{normal}(y_n \mid \mu_1, \sigma_1) \\[2pt] & \qquad + \ \textrm{bernoulli}(0 \mid \lambda) \cdot \textrm{normal}(y_n \mid \mu_2, \sigma_2). \end{align}\] We can further substitute in the value for the Bernoulli densities, which are \[\begin{align} \textrm{bernoulli}(1 \mid \lambda) &= \lambda, \ \textrm{and} \\[4pt] \textrm{bernoulli}(0 \mid \lambda) &= 1 - \lambda, \end{align}\] to produce the standard form of the mixture model sampling distribution, \[ p(y_n \mid \lambda, \mu, \sigma) = \lambda \cdot \textrm{normal}(y_n \mid \mu_1, \sigma_1) + (1 - \lambda) \cdot \textrm{normal}(y_n \mid \mu_2, \sigma_2). \]

11.3 Simulated data: adult height of men and women

Let’s consider a specific instance, a collection of height data where measurements for men and women have not been identified. Let’s assume for the sake of simulation that men have a mean height of 175cm with standard deviation of 7.5cm and women have a mean height of 162cm with a standard deviation of 6.3cm. We’ll further assume that women make up 51% of the adult population (this varies across age bands). This will allow us to simulate and plot data for 5000 randomly selected heights; we put vertical lines at the mean female height (red) and male height (blue).

M = 100 if DRAFT else 5_000
female_mean_height_cm = 162
male_mean_height_cm = 175
female_sd_height_cm = 6.3
male_sd_height_cm = 7.5
prob_female = 0.51
mu = np.array([female_mean_height_cm, male_mean_height_cm])
sigma = np.array([female_sd_height_cm, male_sd_height_cm])
sex = np.array(np.random.binomial(n = 1, p = prob_female, size=M))
heights = np.random.normal(loc = mu[sex], scale = sigma[sex], size=M)

mydraw(
  pn.ggplot(pd.DataFrame({'height': heights}),
                 pn.aes(x = 'height')) +
  pn.geom_histogram(bins=50, color='white') +
  pn.labs(x = 'height') +
  pn.geom_vline(xintercept=162, color='red') +
  pn.geom_vline(xintercept=175, color='blue') +
  pn.theme(axis_text_y = pn.element_blank(),
           axis_title_y = pn.element_blank(),  
           axis_ticks_major_y = pn.element_blank())

)

11.4 The log scale and log sum of exponents operation

We have successfully marginalized out the \(z\) and derived \(p(y \mid \lambda, \mu, \sigma)\), but this leaves us with a residual problem—Stan works on the log scale and we have done our derivation on the linear scale. That is, we start with \[\begin{align} \log p(y_n \mid \lambda, \mu, \sigma) = \log\! \big( \, \strut & \textrm{bernoulli}(0 \mid \lambda) \cdot \textrm{normal}(y_n \mid \mu_0, \sigma_0) \\[2pt] & + \textrm{bernoulli}(1 \mid \lambda) \cdot \textrm{normal}(y_n \mid \mu_1, \sigma_1)\, \big). \end{align}\] To reduce this further, we are going to use the log-sum-exp operation, which is defined by \[ \textrm{log-sum-exp}(a, b) = \log \left( \strut \exp(a) + \exp(b)\right). \]

The reason we use log-sum-exp is that it can be made arithmetically stable in such a way as to prevent numerical overflow by pulling out the maximum value, \[ \textrm{log-sum-exp}(a, b) = \textrm{max}(a, b) + \textrm{log-sum-exp}(a - \textrm{max}(a, b), \ b - \textrm{max}(a, b)) \\[4pt] \] This form eliminates numerical overflow because exponentiation is only applied to the terms \(\left( a - \textrm{max}(a, b)\right)\) and \(\left(b - \textrm{max}(a, b)\right)\), both of which must be less than or equal to zero by construction. Underflow is mitigated by pulling the leading term \(\textrm{max}(a, b)\) out of the sum.

Using log-sum-exp, we can now rewrite our marginalized individual item likelihood as \[\begin{align} \log p(y_n \mid \lambda, \mu, \sigma) = \textrm{log-sum-exp}\big( &\log \textrm{bernoulli}(0 \mid \lambda) + \log \textrm{normal}(y_n \mid \mu_0, \sigma_0), \\[2pt] &\log \textrm{bernoulli}(1 \mid \lambda) + \log \textrm{normal}(y_n \mid \mu_1, \sigma_1) \big). \end{align}\]

11.5 Stan program for mixtures

As in our earlier examples, the Stan code for mixtures just follows the mathematical definitions.

heights.stan
data {
  int<lower=0> M;
  vector<lower=0>[M] heights;
}
parameters {
  real<lower=0, upper=1> lambda;
  ordered[2] mu;
  vector<lower=0>[2] sigma;
}
model {
  mu ~ normal(170, 10);
  sigma ~ lognormal(log(7), 1);
  for (m in 1:M) {
    real lp1 = bernoulli_lpmf(0 | lambda)
        + normal_lpdf(heights[m] | mu[1], sigma[1]);
    real lp2 = bernoulli_lpmf(1 | lambda)
        + normal_lpdf(heights[m] | mu[2], sigma[2]);
    target += log_sum_exp(lp1, lp2);
  }
}

This program uses local variables lp1 and lp2. These have scope just within the for-loop and take on different values with each loop iteration. In general, local variables can be assigned and reassigned in Stan. They are declared with sizes, but cannot have constraints.

Unlike in the original model, we have to index mu and sigma from 1 because Stan arrays are indexed from 1. The biggest departure from the mathematical model as written is that we have declared mu to be of type ordered, which is the type of a vector of increasing values. The reason we do this is that the indexes are not well identified in the sense that swapping indexes preserves the same likelihood. This way, the smaller component will be the first component and the model will be identified without changing the likelihood. Betancourt (2017b) provides more details on how this works and shows that ordering does not change the underlying model.

We have included weakly informative priors for mu and sigma based on our prior knowledge of heights. The likelihood then just follows the mathematical definition. We had to resort to a loop because Stan isn’t currently able to vectorize mixtures.

11.6 Fitting the simulated data

Now we can fit our simulated data.

model = csp.CmdStanModel(stan_file = '../stan/heights.stan')
data = {'M': M, 'heights': heights}
J = 100 if DRAFT else 1000
sample = model.sample(data = data, seed = 123,
                      iter_warmup = J, iter_sampling = J,
                      show_progress = False, show_console = False)
sample.summary(sig_figs = 2)          
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -18000.00 0.0490 1.60 -18000.0 -18000.00 -18000.00 1100.0 12.0 1.0
lambda 0.52 0.0028 0.07 0.4 0.52 0.63 640.0 7.3 1.0
mu[1] 162.00 0.0270 0.69 161.0 162.00 163.00 651.0 7.5 1.0
mu[2] 175.00 0.0470 1.20 173.0 175.00 177.00 627.0 7.2 1.0
sigma[1] 6.10 0.0100 0.28 5.7 6.10 6.60 724.0 8.3 1.0
sigma[2] 7.70 0.0180 0.46 6.9 7.70 8.40 643.0 7.4 1.0

Mixture models are quite challenging to fit when the components have similar distributions as they do here (the normal components have similar locations and scales so that much of their probability mass overlaps). Checking the fit, we find \(\widehat{R}\) values as high as 1.01 (still acceptable) and effective sample sizes of around 1/6 the number of iterations. Nevertheless, we see the Monte Carlo standard error is relatively low for all posterior means other than for \(\lambda\). Although we recover reasonable estimates for all parameters (simulated values were \(\lambda = 0.51\), \(\mu = [162 \ 175]\) and \(\sigma = [6.3 \ 7.5]\)), the posterior intervals are only well constrained for mu, with sigma, with lambda being less well identified. These intervals shrink with the number observations \(N\) at a rate of \(\mathcal{O}(1 / \sqrt{N})\).

11.7 Built-in Stan function for mixtures

For simple mixtures of this kind, Stan supplies a built-in function

log_mix(lambda, lp1, lp2)
    = log_sum_exp(log(lambda) + lp1,
                  log1m(lambda) + lp2)

which is a more efficient drop-in replacement for the explicit definition in our example code. The body of the loop in the model can be replaced with the one-liner

target += log_mix(lambda, normal_lpdf(heights[m] | mu[1], sigma[1]),
                          normal_lpdf(heights[m] | mu[2], sigma[2]));

This may run a bit faster, but it won’t mix any better as it defines exactly the same target log density as the long form.

11.8 Recovering posterior distributions over discrete parameters

We have marginalized out the discrete parameters, but what if they are of interest in the fitted model? It turns out, we can sample \(z\) in the generated quantities block. All we have to do is work out the probability that \(z_n = 0\) (female) and sample. Once we have this probability, we can calculate many quantities of interest in expectation rather than working with the sample (i.e., we will Rao-Blackwellize). The calculation for the expected sex given values of \(\mu\) and \(\sigma\) and \(y_n\) is \[ \Pr\!\left[Z_n = 1 \mid y, \mu, \sigma\right] \ = \ \mathbb{E}\left[Z_n \mid y, \mu, \sigma\right] \ = \ \frac{\displaystyle p(y_n \mid \mu_0, \sigma_0)} {\displaystyle p(y_n \mid \mu_0, \sigma_0) + p(y_n \mid \mu_1, \sigma_{z_n})} \] On the log scale where Stan operates, that’s \[ \log \textrm{Pr}[Z_n = 1 \mid y, \mu, \sigma] = \log p(y_n \mid \mu_0, \sigma_0) - \textrm{log-sum-exp}(\log p(y_n \mid \mu_0, \sigma_0), \log p(y_n \mid \mu_1, \sigma_1). \]

Here’s a new generated quantities block we can add to the last program, heights.stan, in order to generate the probability that the first 10 data items are heights for women, and to randomly generate the sex for the first ten individuals based on this probability.

../stan/heights-post.stan
generated quantities {
  vector<lower=0, upper=1>[10] Pr_female;
  for (m in 1:10) {
    real lp1 = bernoulli_lpmf(0 | lambda)
               + normal_lpdf(heights[m] | mu[1], sigma[1]);
    real lp2 = bernoulli_lpmf(1 | lambda)
               + normal_lpdf(heights[m] | mu[2], sigma[2]);
    Pr_female[m] = exp(lp1 - log_sum_exp(lp1, lp2));
  }
  array[10] int<lower=0, upper=1> sex = bernoulli_rng(Pr_female);
}

Now we can compile and fit this model.

model = csp.CmdStanModel(stan_file = '../stan/heights-post.stan')
data = {'M': M, 'heights': heights}
J = 100 if DRAFT else 1000
sample = model.sample(data = data, seed = 123,
                      iter_warmup = J, iter_sampling = J,
                      show_progress = False, show_console = False)
sample.summary(sig_figs = 2)          
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -18000.0000 0.05000 1.6000 -18000.00000 -18000.0000 -18000.00 1000.0 11.0 1.0
lambda 0.5200 0.00310 0.0690 0.41000 0.5200 0.63 490.0 5.4 1.0
mu[1] 162.0000 0.03000 0.6700 161.00000 162.0000 163.00 515.0 5.7 1.0
mu[2] 175.0000 0.05200 1.1000 173.00000 175.0000 177.00 482.0 5.3 1.0
sigma[1] 6.1000 0.01200 0.2800 5.70000 6.1000 6.60 572.0 6.3 1.0
sigma[2] 7.7000 0.01900 0.4500 7.00000 7.7000 8.50 543.0 6.0 1.0
Pr_female[1] 0.9500 0.00210 0.0470 0.86000 0.9600 0.99 490.0 5.4 1.0
Pr_female[2] 0.0680 0.00180 0.0420 0.01900 0.0590 0.15 520.0 5.7 1.0
Pr_female[3] 0.0600 0.00170 0.0380 0.01600 0.0510 0.13 523.0 5.8 1.0
Pr_female[4] 0.4200 0.00540 0.1200 0.24000 0.4100 0.63 487.0 5.4 1.0
Pr_female[5] 0.0500 0.00150 0.0330 0.01200 0.0420 0.11 527.0 5.8 1.0
Pr_female[6] 0.1600 0.00340 0.0770 0.06200 0.1500 0.31 502.0 5.5 1.0
Pr_female[7] 0.4700 0.00550 0.1200 0.28000 0.4600 0.68 486.0 5.4 1.0
Pr_female[8] 0.9000 0.00270 0.0610 0.79000 0.9100 0.98 495.0 5.5 1.0
Pr_female[9] 0.0065 0.00029 0.0069 0.00074 0.0044 0.02 573.0 6.3 1.0
Pr_female[10] 0.0590 0.00160 0.0380 0.01600 0.0500 0.13 523.0 5.8 1.0
sex[1] 0.9500 0.00370 0.2100 1.00000 1.0000 1.00 3285.0 36.0 1.0
sex[2] 0.0640 0.00400 0.2500 0.00000 0.0000 1.00 3688.0 41.0 1.0
sex[3] 0.0650 0.00420 0.2500 0.00000 0.0000 1.00 3442.0 38.0 1.0
sex[4] 0.4200 0.00950 0.4900 0.00000 0.0000 1.00 2704.0 30.0 1.0
sex[5] 0.0530 0.00390 0.2200 0.00000 0.0000 1.00 3308.0 37.0 1.0
sex[6] 0.1700 0.00630 0.3800 0.00000 0.0000 1.00 3582.0 40.0 1.0
sex[7] 0.4700 0.00900 0.5000 0.00000 0.0000 1.00 3096.0 34.0 1.0
sex[8] 0.9000 0.00510 0.3000 0.00000 1.0000 1.00 3474.0 38.0 1.0
sex[9] 0.0065 0.00140 0.0800 0.00000 0.0000 0.00 3544.0 39.0 1.0
sex[10] 0.0570 0.00380 0.2300 0.00000 0.0000 1.00 3703.0 41.0 1.0

The probabilities range from 0 to 1 with some intermediate values. We have calculated posterior expectations two ways, by averaging expected values (Pr_female) and by taking random draws (sex). The posterior means are similar for these variables, but sex has much higher variance due to the sampling.

We can compare the probabilities to the simulated values of \(z\), which are

print(f"{sex[0:10] =}")
sex[0:10] =array([0, 1, 1, 0, 0, 1, 0, 0, 1, 1])

12 Debugging Stan programs

As is usual in simple tutorials like this, we have presented a series of Stan programs that actually work as intended. When developing new programs, one often runs into problems. This section will cover some typical Stan coding errors and provide some tips on debugging.

12.1 Failure to provide support over constrained parameters

Stan requires programs to assign a finite log density to any parameters that satisfy the constraints. For example, the following program puts no constraints on sigma in the declaration, but the model block requires sigma to be greater than zero.

parameters {
  real sigma;
}
model {
  sigma ~ lognormal(0, 1);
}  

This violates Stan’s requirements because the declared constraints allow a negative value for sigma, but the model block does not. The way to fix this is to declare sigma to be of type real<lower=0>. Often this kind of bug shows up as a failure to randomly initialize. For instance, if we had vector[10] sigma, we’d only have a \(2^{-10}\) or about 1/1000 chance of generating valid initializations at random, because Stan initializes parameters using a \(\textrm{uniform}(-2, 2)\) distribution on the unconstrained scale.

12.2 Start with small, simulated data sets

It can be difficult to debug wild type data at scale. Instead, simulating a small data set lets you work the kinks out of the code in a controlled context where we know the true answers. Then, when the code is working over small simulated data, it can be loosed on the real data.

Simulating data can be challenging for some models. First, it can be a lot of effort, because it requires rewriting the model using random number generation in the generated quantities block. Second, some models are not fully generative because they either have improper priors or because they have reduced rank parameterizations (e.g., intrinsic conditional autoregressive models or sum-to-zero parameter constraints).

12.3 Debug by print

Stan has a built-in print() function that can take string and Stan variable arguments, which will be printed when it is executed. In particular, printing the target() value can help detect when it becomes \(-\infty\) (and thus causes rejections). This can be done line by line, for example,

model {
  y ~ normal(alpha + beta * x, sigma);
  print("step 17: target() = ", target());
  sigma ~ lognormal(0, 1);
  print("step 18: target() = ", target());
}

Then if one of the statements throws an exception, you will see the print statement just before it was called. If one of the statements leads to a not-a-number or negative infinity value, you will see the first place this affects target(). You can also print the results of intermediate calculations. Be careful printing long structures; you might want to try printing a slice, e.g.,

matrix[256, 256] a = b * c;
print("a[1:3, 5:6] = ", a[1, 1]);

12.4 Test quantities of interest

So far, we’ve only talked about making the program work as intended. We also want to test our output as we have done in several of our running examples. If we wrote a model in order to provide posterior predictive inference, test it with posterior predictive inference (either real or simulated).

12.5 More advice on workflow

Going into full details of how to develop Stan models is beyond this simple getting started tutorial. For more information, see Gelman et al. (2020). The various inference mechanisms targeted by the Bayesian workflow suggestions can be implemented in Stan as outlined in Part III of Stan Development Team (2023c).

13 Making Stan programs go faster

For MCMC, there are two independent components to efficiency, (1) the statistical efficiency of the sampler, which is measured by effective sample size per iteration, and (2) program efficiency, which is measured by how long the program takes to compute the log density and gradients.

13.1 Computational efficiency

There are many ways to write a Stan program to compute the log density and gradients of a model with respect to a fixed parameter vector. Some of them are faster than others. Which version is faster may wind up depending on the shape of the data (e.g., caching can be very valuable in some circumstances where variables are heavily reused, but can be a burden in sparse data circumstances where all combinations of variables are not used).

For speeding up Stan programs, most of the usual considerations in speeding up computer programs apply. Specifically, we want to optimize memory locality and we want to reduce branch points; together these are the main efficiency bottlenecks for modern code.

Working first, then optimize

As Donald Knuth is reputed to have said, “Premature optimization is the root of all evil.” It’s pretty much always a good idea to get a simple version of your code working on a small slice of data first. Then only worry about optimization if you need it. If you do need to optimize, don’t try it without an unoptimized form that works as a guide. Every developer has learned through hard experience that it’s often a whole lot easier to optimize working code than it is to debug optimized code.

Do not repeat yourself

The main rule for making Stan programs go fast is don’t repeat yourself. If there is a computation, don’t do that computation again, save the result in a local variable and reuse. For example, don’t do this:

for (n in 1:N) {
  y[n] ~ normal(alpha + beta * x[n], exp(log_sigma));
}

The problem here is that exp(log_sigma) gets computed N times. This uses more memory, as Stan records every operation with links to its operands during. It is also slow, with N - 1 redundant applications of exponentiation in the forward pass and and multiply/adds for the chain rule in the backward pass. Instead, the code should be refactored to compute the exponentiation of log_sigma once, assign it to a local variable, and re-use it in the loop.

real sigma = exp(log_sigma);
for (n in 1:N) {
  y[n] ~ normal(alpha + beta * x[n], sigma);
}

Avoid automatic differentiation if possible

If you have a constant that’s being defined, define it in the transformed data block. That way it only gets evaluated once. For example, don’t do the following, as it repeatedly allocates and fills a \(K\)-vector and then applies useless automatic differentiation steps in the reverse pass.

data {
  real<lower=0> alpha;
}
model {
  vector[K] alpha_v = rep_vector(alpha, K);
  theta ~ dirichlet(alpha_v);
}  

Instead, define the constant in the transformed data block and then reuse it in the model block, as follows.

transformed data {
  vector<lower=0>[K] alphas = rep_vector(alpha, K);
}
model {
  theta ~ dirichlet(alphas);
}

The other way to get rid of autodiff is to move posterior predictions into the generated quantities block. In general, if it’s possible to move a variable to the generated quantities block, it should be moved. For example, we might be tempted do posterior predictive simulation as follows.

parameters {
  real alpha, beta;
  real<lower=0> sigma;
  vector[M_tilde] y_tilde;
}
model {
  y ~ normal(alpha + beta * x, sigma);
  y_tilde ~ normal(alpha + beta * x_tilde, sigma);
}

This is correct and will provide the right answer, but it is computationally and statistically inefficient. It’s computationally inefficient because Hamiltonian Monte Carlo is more expensive than random variate generation. It’s also statistically inefficient because it will typically produce correlated draws rather than independent draws, leading to a lower effective sample size. In situations like this, we can move y_tilde down to generated quantities because it’s conditionally independent of the data given mu and sigma.

parameters {
  real alpha, beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
}
generated quantities {
  array[M_tilde] y_tilde = normal_rng(alpha + beta * x_tilde, sigma);
}  

Vectorized _rng functions return arrays; if we really need a vector for y_tilde for some reason, we can convert with to_vector().

Do not use diagonal multivariate normal

Instead of

y ~ multi_normal(mu, diag_matrix(sigma));

which creates a matrix then factors to compute its inverse and determinant, we can use the vectorized form of the univariate normal,

y ~ normal(mu, sigma);

The result is the same up to a constant.

Memory locality, copying, and storage order

Accessing a row or column of a matrix requires allocating memory and copying. Memory pressure is one of the biggest constraints in modern code and relieving it is one of the best things to do to make code go faster.

Matrices are organized in column-major order in memory, whereas arrays are row major. That means that the columns of a matrix are stored together in memory, and therefore slicing out a column Sigma[1:M, n] is relatively efficient for large M, but slicing out a row using Sigma[m, 1:N] will be inefficient if M is large.

If you need to access vectors one by one but you do not need matrix operations, use an array of vectors in either order:

matrix[M, N] alpha;
array[M] row_vector[N] alpha_arr;
array[N] vector[M] alpha_t_arr;  // transposed

Accessing elements of an array does not require any copying.

Overall, copying and rearranging data structures is not very expensive in Stan. Loops are very fast, and rearranging doesn’t add anything to the expression graph for automatic differentiation.

Vectorize non-linear operations over parameters

Even though loops are very fast in Stan, every expression involving parameters leads to nodes and edges being added to the expression graph. The usual suspect is sampling in a loop, as follows.

for (n in 1:N) {
  y[n] ~ normal(alpha + beta * x[n], sigma);
}

Written this way, Stan isn’t smart enough to realize it can compute log(sigma) once and re-use. It will also build a huge number of expression graph edges from the log density of to alpha and beta, one for each y[n]. The same contribution to the log density can be executed much faster with vectorized code such as the following.

y ~ normal(alpha + beta * x, sigma);

The speed from vectorization is not because loops are slow, it’s because doing autodiff in a loop leads to a much larger expression graph and hence much more work when updating the chain rule. The vectorized form has a single output variable for the log density with three edges leading to alpha, beta, and sigma. It will also be clever enough to calculate log(sigma) once and reuse it. This advantage is even bigger for distributions like multivariate normal that require expensive internal determinant calculations.

13.2 Statistical efficiency

By far the best solution for making Stan programs go faster is to improve the statistical efficiency of the model.

Identifiability

The first thing to consider is identifiability. A likelihood \(p(y \mid \theta)\) is identifiable if distinct parameters lead to distinct likelihoods, or in symbols, if \(\theta \neq \theta'\) implies \(p(y \mid \theta) \neq p(y \mid \theta')\). The other way around, a likelihood is not identifiable if there exist \(\theta \neq \theta'\) such that \(p(y \mid \theta) = p(y \mid \theta').\)

The classic Bradley-Terry model for inferring quality based on pairwise comparisons has a non-identifiable likelihood. Let’s consider an application to sports team ranking (it was originally applied to consumer products). Each team \(j\) has an ability ranking \(\alpha_j\). The log odds of team \(j\) defeating team \(k\) is modeled as \(\alpha_j - \alpha_k\). The observations are of the results of games \(n\) between team \(A_n\) and \(B_n\) (the notation is meant to be suggestive of the original application to A/B testing). This defines the likelihood \[ p(y_n \mid \alpha) = \textrm{bernoulli}(\textrm{logit}^{-1}(\alpha_{A_n} - \alpha_{B_n}). \] In Stan, this likelihood can be coded as

model {
  y ~ bernoulli_logit(alpha[A] - alpha[B]);
}

where the vectorized sampling statement unfolds to the following equivalent, but less efficient, form.

  for (n in 1:N) {
    y[n] ~ bernoulli_logit(alpha[A[n]] - alpha[B[n]]);
  }    

The model is clearly non-identifiable because \[ p(y_n \mid \alpha) = p(y_n \mid \alpha + c). \]

To get around this problem, we can do three things. The first two involve reducing the number of free parameters and the third involves soft identification through a prior. First, we can pin one of the values \(\alpha\), traditionally by setting \(\alpha_1 = 0\). This reduces the number of free parameters by one, as is reflected in the Stan parameterization.

parameters {
  vector[J - 1] alpha_rest;
}
transformed parameters {
  vector[J] alpha = append_row(0, alpha_rest);
}

We can then work with alpha directly. In the second approach, we can enforce the constraint \(\textrm{sum}(\alpha) = 0\). This also reduces the number of free parameters by one and is coded in a similar way, only with a different transform.

transformed parameters {
  vector[J] alpha = append_row(-sum(alpha_rest), alpha_rest);
}  

On the plus side, this approach is symmetric. On the downside, it is not generative in that it is not clear how to add a new player \(J + 1\).

The third way to make a Bayesian model identifiable is to add a prior. For example, we can do this,

model {
  alpha ~ normal(0, 3);
}

This provides a kind of soft identification in the posterior, where values of alpha near 0 have higher posterior density. The third approach is overparameterized but has a pleasing symmetry and generalizability to new players.

Adaptation and preconditioning

The condition number of a positive-definite matrix is the ratio of its largest eigenvalue to its smallest. For a density, we can consider the negative inverse Hessian of its log density at a point. For a multivariate normal distribution, the negative inverse Hessian is its covariance at every point. That is, a normal distribution has constant curvature.

The larger the condition number, the harder it is to sample from a model, because multiple steps at the scale of the smallest eigenvalue are required to take a step in the direction of the largest eigenvalue. The first-order gradient-based approximation that Hamiltonian Monte Carlo uses can also fail due to high curvature; when the step is too large, the result will be divergent Hamiltonian trajectories (i.e., ones that do not preserve the Hamiltonian).

When the inverse Hessian is not positive definite, the density is not convex, which can present an even greater challenge to sampling than poor conditioning. If the inverse Hessian is positive definite and relatively constant over the posterior, we can correct for non-unit conditioning by preconditioning, which can rotate and scale a density until it looks more like a standard normal. In fact, given a multivariate normal it does just that—rotates and scales back to a standard normal.

Stan begins adaptation in a state that assumes the posterior is standard normal, which is (a) centered at the origin, (b) independent in dimension, and (c) unit scale. The closer a posterior is to standard normal, the easier it will be for Stan to adapt and to sample.

During adaptation, Stan estimates the posterior (co)variance to use for preconditioning. If the posterior Hessian is constant, the covariance will be equal to the negative inverse Hessian everywhere.

With its default settings, Stan preconditions by scaling each of the parameters as close to unit scale as it can manage. Under this scheme, in \(D\) dimensions, the adapted metric requires \(\mathcal{O}(D)\) storage and each leapfrog step (the most inner loop in Stan) requires \(\mathcal{O}(D)\) time. By selecting a dense metric, Stan will also adapt to a fixed curvature (i.e., one that does not vary over the posterior). This requires \(\mathcal{O}(D^2)\) storage, several \(\mathcal{O}(D^3)\) costs to Cholesky factor, and then requires \(\mathcal{O}(D^2)\) time per leapfrog step to evaluate.

Reparameterization

We can often reparameterize models to make them easier to sample. The standard example is reparameterizing a varying effect to use a non-centered parameterization. The centered parameterization is

parameters {
  real<lower=0> sigma;
  vector[N] alpha;
}
model {
  sigma ~ lognormal(0, 1.5);
  alpha ~ normal(0, sigma);
}

This model induces a strong dependence between sigma and alpha. When sigma is small, alpha must be small, and when sigma is large, alpha will vary over a wide range. This varying curvature (high curvature for low sigma, very flat expanses for high sigma) is very challenging for MCMC sampling.

We can overcome the difficulty in this parameterization by non-centering, which we can code explicitly as follows.

parameters {
  real<lower=0> sigma;
  vector[N] alpha_std;
}
transformed parameters {
  alpha = sigma * alpha_std;
}
model {
  sigma ~ lognormal(0, 1.5);
  alpha_std ~ normal(0, 1);
}

We now have an independent standard normal distribution on alpha_std, not on alpha. It induces the correct distribution on the transformed parameter alpha, so that alpha has a normal(0, sigma) distribution. While this correction is only in the prior, if there is not a lot of data for each n, the posterior will be similar.

We can code the same model as above using an affine variable transform to rescale alpha with sigma.

parameters {
  real<lower=0> sigma;
  vector<multiplier=sigma>[N] alpha;  // non-centering
}
model {
  sigma ~ lognormal(0, 1.5)
  alpha ~ normal(0, sigma);  
}

Priors for speed and knowledge

Often we are tempted to use vague or overdispersed priors. When this gets too extreme, it can cause computational difficulties navigating flat expanses of probability space. And we will have statistical difficulties from the prior pulling most of the probability mass away from the natural posterior.

If you have used vague priors to avoid thinking too hard and computation is going awry, you might consider adding priors with a bit more information. It’s often possible to impose weakly informative priors that only determine the scale of a variable, which is often known. For example if we think a value \(\alpha\) is going to have a value in the single digits, we can use a prior like

alpha ~ normal(0, 5);

that puts approximately 95% of the probability mass in the region \((-10, 10)\).

We do not recommend interval constraints other than for naturally constrained parameters (e.g., probabilities fall in \([0, 1]\) and concentrations in \([0, \infty)\)). The reason we do not want to use

alpha ~ uniform(-10, 10);

is that if the data is consistent with alpha near the boundary (-10 or 10), then probability mass gets bunched up and posterior means get pushed away from the boundary. For example, consider the following Stan program.

stan/interval-censor.stan
data {
  int<lower=0> B;
  int<lower=0> N;
  vector[N] y;
}
parameters {
  real<lower=-B, upper=B> mu;
  real<lower=0> sigma;
}
model {
  y ~ normal(mu, sigma);
}  

In this model, we have an improper uniform prior on sigma and a proper uniform prior on mu in the interval \((-B, B)\). Stan allows improper priors as long as the posterior is proper.

We can then simulate data according to the sampling distribution, \[ y_n \sim \textrm{normal}(9.9, 4). \] Although the true value of \(\mu\) is 9.9, which falls in the interval \((-10, 10)\), consider what happens when we fit 50 random draws with this model.

model = csp.CmdStanModel(stan_file = '../stan/interval-censor.stan')
N = 50
mu = 9.9
sigma = 4
y = np.random.normal(mu, sigma, N)
B = 10
data = {'B': B, 'N': N, 'y': y}
M = 100 if DRAFT else 1000
sample = model.sample(data = data, seed = 123,
                      iter_warmup = M, iter_sampling = M,
                      show_progress = False, show_console = False)
print(f"interval = ({-B}, {B})")
sample.summary(sig_figs = 2)              
interval = (-10, 10)
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -92.0 0.0320 1.10 -94.0 -91.0 -90.0 1200.0 18000.0 1.0
mu 9.7 0.0057 0.28 9.1 9.7 10.0 2300.0 35000.0 1.0
sigma 3.8 0.0100 0.41 3.2 3.8 4.5 1500.0 23000.0 1.0

What happens is that all of the posterior mass for \(\mu\) that would fall above 10 is pushed down inside the interval. As a result, we bias \(\mu\) and \(\sigma\) so that their estimates are too low. If we were to expand the interval to 20 and fit the same data, the result is quite different.

B = 20
data = {'B': B, 'N': N, 'y': y}
M = 100 if DRAFT else 1000
sample2 = model.sample(data = data, seed = 123,
                      iter_warmup = M, iter_sampling = M,
                      show_progress = False, show_console = False)
print(f"interval = ({-B}, {B})")
sample2.summary(sig_figs = 2)             
interval = (-20, 20)
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -88.0 0.0250 1.10 -90.0 -87.0 -87.0 1700.0 29000.0 1.0
mu 10.0 0.0110 0.54 9.4 10.0 11.0 2500.0 41000.0 1.0
sigma 3.8 0.0078 0.42 3.2 3.8 4.6 2900.0 48000.0 1.0

If you were thinking we should recover \(\mu = 9.9\) and \(\sigma = 4\), it’s worth noting the sample mean and standard deviation for our simulated data y, which are

print(f"mean(y) = {np.mean(y):5.2f};  sd(y) = {np.std(y):4.2f}")
mean(y) = 10.28;  sd(y) = 3.67

13.3 A final note on priors

Priors are particularly important when there is not much data in order to control computation. We can evaluate prior sensitivity by considering changing priors, as we did in the last section, to see what their effect is on the posterior.

In general, we would recommend using as much prior knowledge as you can during inference. Not doing so leaves useful information on the table.

14 Further reading

14.1 Scientific programming in Python

A good introduction to Python for data science applications is VanderPlas (2023). The current edition is available open access from the GitHub repository for the Python Data Science Handbook.

For intermediate Python programming, a standard reference is Ramalho (2022). The current edition is available from the GitHub repository for Fluent Python.

14.2 Random number generation

The standard reference for pseudorandom number generation is still the book by Devroye (1986). It starts from scratch with an overview of uniform pseudorandom number generation and proceeds through advanced multivariate techniques. The author has provided a free online pdf on the home page of Non-Uniform Random Variate Generation.

14.3 Bayesian statistics

For an introduction to Bayesian statistics, I would highly recommend starting with the book by McElreath (2023). The second edition is available as a free online pdf from the GitHub repository for Statistical Rethinking.

For more advanced Bayesian statistics and more on computation, the third edition of the classic by Gelman et al. (2013). The third edition is available as a free pdf from the home page for Bayesian Data Analysis 3.

14.4 Markov chain Monte Carlo

A good single-source for learning about MCMC methods is the handbook edited by Brooks et al. (2011). The key chapters on basic MCMC theory, convergence monitoring, and Hamiltonian Monte Carlo are available for free from the sample chapters page of the Handbook of Markov Chain Monte Carlo.

We can also recommend the paper by Roberts and Rosenthal (2004) as a solid mathematical foundation for MCMC that is about the same mathematical level as the appendices of this introduction.

14.5 Stan

The reference Stan paper is by our original development team (Carpenter et al. 2015), but there is no point in reading that paper if you have read this. It’s better to skip directly to the Stan User’s Guide (Stan Development Team 2023c), which contains much more information on modeling in Stan than this introduction. For full details on Stan’s programming language and execution, see the Stan Reference Manual (Stan Development Team 2023b); for information on the library of special mathematical and statistical functions, see the Stan Functions Reference (Stan Development Team 2023a).

There is also a wealth of information available in the form of Stan case studies and Stan conference proceedings, both of which are distributed in the form of reproducible notebooks.

14.6 Bayesian workflow

The paper that kicked off the Stan project’s focus on workflow is Gabry et al. (2019); the current state of our thinking is Gelman et al. (2020). Many of the Stan developers are working on turning the paper into a book; the current state is available from the public GitHub repository for Bayesian Workflow.

AA. Set theory

Probability theory is based on simple notions from set theory such as unions and intersections. We will need to generalize to infinite sets and worry about issues such as countability, all of which we define in this appendix.

A.1 Basic set notation

One way to think of a set is as a bag of objects without duplicates. We write sets using curly braces. For example, the set containing the numbers 1 and 2 is written as \(\{ 1, 2 \}\). Sets are not ordered, so the notation \(\{ 1, 2 \}\) picks out the same set as \(\{ 2, 1 \}\).

Ellipses notation

Sets are often written with informal shorthand with ellipses (\(\ldots\)) standing in for a sequence of elements. For example, \(\{ 1, 2, \ldots, 7\}\) would pick out the set \(\{ 1, 2, 3, 4, 5, 6, 7 \}\) and \(\{ x_0, x_1, \ldots, x_4 \}\) is shorthand for \(\{ x_0, x_1, x_2, x_3, x_4 \}\).

Ellipses that end rather than separate are used to denote an unbounded sequence of elements. For example, we will write \(\{ 0, 1, 2, \ldots \}\) for the set of all counting numbers.

A.2 Membership, subsets, and equality

If \(A\) is a set, we write \(x \in A\) if \(x\) is a member of the set \(A\), and we write \(x \not\in A\) otherwise. For example, \(1 \in \{ 1, 2 \}\), but \(7 \not\in \{ 1, 2 \}\).

If \(A\) and \(B\) are sets, we write \(A \subseteq B\) and say that \(A\) is a subset of \(B\) if \(x \in A\) implies \(x \in B\). For example, \(\{ 1, 3 \} \subseteq \{ 0, 1, 2, 3, 4 \}\), but \(\{ 1, 3 \} \not\subseteq \{ 0, 2, 4 \}\).

Two sets \(A\) and \(B\) are equal, written \(A = B\), if they are subsets of each other, \(A \subseteq B\) and \(B \subseteq A\).

A.3 Distinguished sets

Empty set

We use the notation \(\emptyset\) for the unique empty set which contains no members. In other words, for any \(x\), we know \(x \not\in \emptyset\). That means the empty set is a subset of every other set \(A\), \(\emptyset \subseteq A\).

Universal set

A universal set \(\mathcal{U}\) is one such that for every set \(A\) under consideration in some context, we have \[ A \subseteq \mathcal{U}. \] Depending on the sets under consideration, universal sets may not exist (in the sense of not being a set). We will only be considering sets of numbers here rather than infinitely nested sets of sets, so most of the complications of more general set theory do not arise.

Sets of numbers

The symbol \(\mathbb{Z}\) is conventionally used for the set of all integers (i.e., whole numbers). The symbol \(\mathbb{N}\) is used for the set of all counting numbers (i.e., non-negative whole numbers). The symbol \(\mathbb{R}\) is conventionally used for the set of all real numbers. The symbol \(\mathbb{C}\) is used for the set of all complex numbers.

A.4 Finite and infinite sets

A set \(s\) is said to be finite if it can be written in the form \(s = \{ x_0, x_1, \ldots, x_{n-1}\}\) for some \(n \geq 0.\) If a set is not finite, it is said to be infinite.

The size of a finite set is the number of distinct members it has. For example, the set \(s = \{ 1, 2, 3 \}\) has size 3. It is standard to use absolute-value notation for the size of a set, for example, \[ \left| \{ 1, 2, 3 \} \right| = 3. \] In general, every finite set \(s = \{ x_0, \ldots, x_{n-1} \}\) has size \(n\). The empty set \(\emptyset\) has size 0.

A.5 Comprehensions and power sets

Comprehensions are a way to turn conditions into sets. For example, I might take all the natural numbers that satisfy the condition of being odd or being less than 17 or both. We use the standard comprehension notation, \[ B = \{ x \in A \, : \, \phi(x) \}, \] for the set \(B\) that contains all of the elements of \(A\) that satisfy the condition \(\phi\).

The set of all subsets of a set \(A\) is called its power set and is defined by \[ \textrm{pow}(A) = \{ B : B \subseteq A \}. \] Another common notation for power sets is \(\mathbf{2}^A.\)

A.6 Union, intersection, difference, and complement

The union of two sets \(A\) and \(B\), written \(A \cup B\), contains the elements that are in either set, \[ A \cup B = \{ x \, : \, x \in A \textrm{ or } x \in B \}. \] The intersection of two sets \(A\) and \(B\), written \(A \cap B\), contains the elements that are in either set, \[ A \cap B = \{ x \, : \, x \in A \textrm{ and } x \in B \}. \] Logically speaking, union is disjunction (or), whereas intersection is conjunction (and).

Both unions and intersections can be extended to infinite sets using subscript notation. If \(A_0, A_1, \ldots\) is an infinite sequence of sets, then \[ \bigcup_{n = 0}^{\infty} A_n = \{ x \, : \, x \in A_n \textrm{ for some } n < \infty \} \] and \[ \bigcap_{n = 0}^{\infty} A_n = \{ x \, : \, x \in A_n \textrm{ for all } n < \infty \} \] In the infinite case, union parallels existential quantification (for some), whereas intersection parallels universal quantification (for every).

If \(A\) and \(B\) are sets, then their difference, written \(A \setminus B\), is the set of elements in \(A\) that are not in \(B\), \[ A \setminus B = \{ x \in A \, : \, x \not\in B \}. \]

If there is a universal set \(\mathcal{U}\), the complement of the set \(A\) is the set of elements not in \(A\), \[ A^\complement = \mathcal{U} \setminus A = \{ x \in \mathcal{U} \, : \, x \not\in A \}. \] The complement \(A^\complement\) is sometimes written as \(\bar{A}\).

If we have a universal set \(\mathcal{U}\), then we can use it to define difference in terms of complementation through \[ A \setminus B = A \cap B^\complement. \]

A.7 Disjointness and partitions

A pair of sets \(A\) and \(B\) are disjoint if \(A \cap B = \emptyset\). A potentially infinite sequence of sets \(A_0, A_1, \ldots\) is pairwise disjoint if for every \(i, j < \infty\), \(A_i\) and \(A_j\) are disjoint, \(A_i \cap A_j = \emptyset\).

A pairwise disjoint sequence of sets \(B_0, B_1, \ldots\) is said to be a partition of a set \(A\) if \[ A = \bigcup_{n < \infty} B_n. \]

A.8 Basic laws of set theory

Commutativity

Set operations are commutative, meaning that for all sets \(A\) and \(B\), \[ A \cup B = B \cup A \] and \[ A \cap B = B \cap A. \]

Associativity

Set operations are associative, meaning that for all sets \(A\), \(B\), and \(C\), \[ (A \cup B) \cup C = A \cup (B \cup C) \] and \[ (A \cap B) \cap C = A \cap (B \cap C). \]

Distributivity

The intersection operation is distributive with respect to the union operation, so that for all sets \(A\), \(B\), and \(C\), \[ A \cap (B \cup C) = (A \cap B) \cup (A \cap C). \] Because of commutativity, this also holds on the right for \((B \cup C) \cap A\).

Involutivity

The complementation operation is an involution (i.e., equal to its own inverse), so that for all sets \(A\), \[ \left(A^\complement\right)^{\complement} = A. \]

Identities

The empty set is a left and right identity for union, so that for all sets \(A\), \[ \emptyset \cup A \ = \ A \cup \emptyset \ = \ A. \] If there is a universal set \(\mathcal{U}\), it is an identity for intersection, so that for all sets \(A\), \[ \mathcal{U} \cap A \ = \ A \cap \mathcal{U} \ = \ A. \]

Zeroes

The empty set is a zero for intersection, so that for all sets \(A\), \[ \emptyset \cap A \ = \ A \cap \emptyset \ = \ \emptyset. \] The universal set, if it exists, is a zero for union, so that for all sets \(A\), \[ \mathcal{U} \cup A \ = \ A \cup \mathcal{U} \ = \ \mathcal{U}. \]

Idempotence

The union and intersection operations are idempotent, so that for all sets \(A\), \[ A \cup A = A \] and \[ A \cap A = A. \]

DeMorgan’s laws

DeMorgan’s laws relate complementation, intersection, and union, where for all sets \(A\), \(B\), and \(C\), \[ \left(A \cup B\right)^\complement = A^\complement \cap B^\complement. \] and \[ \left(A \cap B\right)^\complement = A^\complement \cup B^\complement. \]

A.9 Tuples, cross-products, and relations

Tuples

Given elements from two sets, \(a \in A\) and \(b \in B\), we write \((a, b)\) for the tuple consisting of the pair \(a\) and \(b\) in that order. The identity conditions for tuples are that \((a, b) = (a', b')\) if and only if \(a = a\) and \(b = b.\) That means that if \(a \neq b,\) then \((a, b) \neq (b, a).\)

Cross products and exponents

The cross product of two sets \(A\) and \(B\) is written \(A \times B\) and defined to be the set of tuples formed from \(A\) and \(B,\) \[ A \times B = \{ (a, b) : a \in A, b \in B \}. \] This notion can be extended to any number of sets by taking the operator to be left associative and setting, \[ A_1 \times A_2 \times \cdots A_N = A_1 \times (A_2 \times \cdots A_N). \] Despite the nesting, we write \((a, b, c)\) rather than \((a, (b, c)).\)

For the cross product of a set \(A\) with itself \(n \in N\) times, we define the set exponent as \[ A^n = \underbrace{A \times \cdot \times A}_{n \textrm{ times}}. \]

Relations

A relation between the sets \(A\) and \(B\) is a set \(R \subseteq A \times B.\) For example, the comparison operator \(\leq\) defines a relation over the real numbers \(L = \{ (x, y) \in \mathbb{R} \times \mathbb{R} : x \leq y \}.\)

A relation \(R\) is said to be symmetric if \((a, b) \in R\) implies \((b, a) \in R\). The inequality relation is not symmetric, but the equality relation \(E = \{ (x, x) \in \mathbb{R} \times \mathbb{R} : x \in \mathbb{R} \}\) is.

A relation \(R\) is said to be transitive if \((a, b) \in R\) and \((b, c) \in R\) implies \((a, c) \in R.\) Both equality and inequality are transitive relations, but the within unit distance relation \(W = \{ (x, y) \in \mathbb{R} \times \mathbb{R} : | x - y | \leq 1 \}\) is not.

If \(R\) and \(T\) are relations, then their composition \(R \circ T\) is defined by \[ R \circ T = \{ (a, c) : (a, b) \in R, (b, c) \in R \}. \] Thus a relation \(R\) is transitive if \(R \circ R = R.\)

A.10 Functions

Functions

A relation \(f \subseteq A \times B\) is said to be a function if for every \(a \in A\) there is exactly one \(b \in B\) such that \((a, b) \in f.\) If \((a, b) \in f\) we write \(f(a) = b\) and say \(a\) maps to \(b.\) If \(f \subseteq A \times B\) is a function, we write \(f:A \rightarrow B\).

Domains and ranges

If \(f:A \rightarrow B\) is a function, it domain is \(A\) and its codomain is \(B.\) The range of a function is the set of values that result from applying the function to elements of \(A.\) If we define \(f:\mathbb{R} \rightarrow \mathbb{R}\) by \(f(x) = x^2,\) then the codomain is \(\mathbb{R},\) but the range is \((0, \infty).\) The range of a function is always a subset of its codomain.

Given a function, its domain is \[ \textrm{dom}(f) = \{ x : (x, y) \in f \} = A, \] and its range is \[ \textrm{rng}(f) = \{ y : (x, y) \in f \}, \] but we cannot infer its codomain from its members. The codomain only matters in a higher-level context in which the function is considered a member of a larger family of functions.

Given two sets \(A\) and \(B,\) the set of functions from \(A\) to \(B\) is written as \(B^A,\) and defined by \[ \textrm{B^A} = \{ f : f:A \rightarrow B \textrm{ is a function} \}. \]

Composition

If \(f:A \rightarrow B\) and \(g:B \rightarrow C\) are functions, then their composition \(g \circ f:A \rightarrow C\) is also a function, and satisfies \((g \circ f)(a) = g(f(a)).\)

Injective, surjective, and bijective functions

A function \(f : A \rightarrow B\) is said to be

  • injective if \(a \neq b\) implies \(f(a) \neq f(b),\)
  • surjective if \(B = \textrm{range}(B),\) and
  • bijective if it is both injective and surjective.

An injective function is called an injection, a surjective one a surjection, and a bijective one a bijection. Surjections are also said to be “onto” and bijections are said to be “one to one.” A function \(f:A \rightarrow B\) is said to be many to one if there exists \(a \neq a'\) such that \(f(a) = f(a').\)

Inverse functions

If \(f:A \rightarrow B\) is a bijection, it has an inverse \(f^{-}:B \rightarrow A\) defined so that \(f^{-1}(b) = a\) if \(f(a) = b.\) The composition of a function with its inverse or vice-versa is the identity function, \(f \circ f^{-} = \textrm{id},\) \(f^{-} \circ f = \textrm{id}.\) Thus for a bijection \(f:A \rightarrow B,\) we have \(f^{-1}(f(a)) = a\) for all \(a \in A\) and \(f(f^{-}(b)) = b\) for all \(b \in B.\)

Special functions

For a given set \(A\), the *identity function$ \(\textrm{id}_A:A \rightarrow A\) maps each element to itself, i.e., \(\textrm{id}_A(a) = a.\). We usually drop the subscript and just write \(\textrm{id}.\)

For any set \(A\), the indicator function \(\textrm{I}_A\) is defined so that $$ _A(x) = \[\begin{cases} 1 & \textrm{if } x \in A, \textrm{ and} \\[2pt] 0 & \textrm{otherwise}. \end{cases}\]

14.7 A.11 Finite, countable, and uncountable sets

A set \(A\) is finite if there exists a number \(n \in \mathbb{N}\) and a bijection from \(\{ 0, 1, \ldots, n \}\) and \(A.\) The number \(n\) is the size of the set. A set is infinite if it is not finite.

A set \(A\) is countable if there is surjective function \(f: \mathbb{N} \rightarrow A\) and is said to be uncountable otherwise. A countable set may be either finite or infinite. It can be shown that the integers \(\mathbb{Z}\) are countable and the set of real numbers \(\mathbb{R}\) is uncountable.

A. Probability theory

The goal of this overview is to provide precise definitions of those aspects of probability theory required to understand the operations used in applied Bayesian statistics. The presentation was heavily influenced by Anderson and Moore (1979)’s book Optimal Filtering, (Appendix A, “Brief Review of Results of Probability Theory”).

A.1 Probability spaces

Sample spaces and \(\sigma\)-algebras

A sample space is a non-empty set \(\Omega\), the members of which are called outcomes because they represent possible ways the world might be (i.e., what are sometimes known as “possible worlds”).

A \(\sigma\)-algebra \(\mathcal{S}\) is a set of subsets of \(\Omega\), \(\mathcal{S} \subseteq \textrm{pow}(\Omega)\), that satisfy three properties,

  • space is measurable: \(\Omega \in \mathcal{S}\),
  • closed under complements: if \(A \in \mathcal{S},\) then \(A^{\complement} \in \mathcal{S}\), and
  • closed under countable unions: if \(A_n \in \mathcal{S}\) for \(n \in \mathbb{N},\) then \(\bigcup_{n=0}^N A_n \in \mathcal{S}\).

Because \(\Omega \in \mathcal{S}\), its complement \(\emptyset = \emptyset^{\complement} \in \mathcal{S}\). Closure under intersections follows from De Morgan’s law, \[ A \cap B = \left( A^{\complement} \cup B^{\complement} \right)^\complement, \] which extends to the countably infinite case \[ \bigcap_{n \in \mathbb{N}} A_n = \left( A_n^\complement \right)^{\complement}. \]

For applied statistics, we can restrict our attention to sample spaces \(\Omega \subseteq \mathbb{R}^N\).

Example 1(a): The power set of any non-empty set contains the empty set and is closed under complementation and countable unions, and is hence a \(\sigma\)-algebra. The trivial case has a singleton sample space \(\Omega = \{ 0 \}\), with \(\sigma\)-algebra \(\mathcal{S} = \left\{ \{\}, \{ 0 \} \right\}\). The simplest non-trivial example has a binary sample space \(\Omega = \{ 0, 1 \}\), with \(\sigma\)-algebra \(\mathcal{S} = \left\{ \{\}, \{ 0 \}, \{ 1 \}, \{ 0, 1 \} \right\}\).

*Example 1(b): An example of a countably infinite sample space, we can take the power set of the natural numbers, \(\Omega = \textrm{pow}(\mathbb{N}),\) which is not only a \(\sigma\)-algebra, but a complete, atomic, boolean algebra.

Example 2(a): A simple continuous example takes a sample space \(\Omega = (0, 1)\) and defines \(\mathcal{S}\) to be the smallest \(\sigma\)-algebra containing the open intervals \((a, b) \subseteq \Omega\) (i.e., \(0 < a < b < 1\)) and closed under complements and countable unions. The smallest \(\sigma\)-algebra containing the open sets of a sample space is called a Borel algebra and can be shown to exist. This \(\sigma\)-algebra contains all of the singletons, because \(\{ a \} = ((0, a) \cup (a, 1))^{\complement}.\)

Example 2(b): Another example of a Borel algebra is the closure of the open intervals over the entire real line, \(\Omega = \mathbb{R}\).

Example 3: If \(\mathcal{S}_1\) and \(\mathcal{S}_2\) are \(\sigma\)-algebras over \(\Omega_1\) and \(\Omega_2\), then the product space \(\mathcal{S} = \left\{ A_1 \times A_2 : A_1 \in \mathcal{S}_1 \ \textrm{and} \ A_2 \in \mathcal{S}_2 \right\}\) is a \(\sigma\)-algebra over \(\Omega = \Omega_1 \times \Omega_2.\)

Measures and probability spaces

Probability measures are measures where the sample space is normalized to have unit measure. A probability measure over a \(\sigma\)-algebra \(\mathcal{S}\) is a function \(\Pr: \mathcal{S} \rightarrow [0, 1]\) such that for all \(A, B_n \in \mathcal{S},\)

  • sample space unit probability: \(\Pr[\Omega] = 1\),
  • non-negativity: if \(A \in \mathcal{S}\), then \(\Pr[A] \geq 0\), and
  • countable additivity: if \(A_n \in \mathcal{S}\) for \(n \in \mathbb{N}\) is a sequence of pairwise disjoint sets (i.e., \(A_i \cap A_j = \emptyset\) if \(i \neq j\)), then \[ \Pr\!\left[\bigcup_{n = 0}^\infty B_n\right] = \sum_{n=0}^\infty \Pr\!\left[B_n\right]. \]

The sample space \(\Omega\) with a probability measure \(\Pr\) over the \(\sigma\)-algebra \(\mathcal{S}\) together make up the probability space \((\Omega, \mathcal{S}, \Pr)\).

In analysis, an element \(A \in \mathcal{S}\) is said to be a measurable set. Measurable sets in a probability measure are called events. Because \(A^\complement \cup A = \Omega\), it follows from countable additivity that \[ \Pr[A^{\complement}] = 1 - \Pr[A], \] and in particular, \(\Pr[\emptyset] = \Pr[\Omega^\complement] = 1.\) We also have \(\Pr[A] \leq 1\) for any event \(A \in \mathcal{S}.\)

Example 1(a), continued: We can define a measure over the \(\sigma\)-algebra with sample space \(\Omega = \{ 0, 1 \}\) and measurable sets \(\mathcal{S} = \left\{ \{\}, \{ 0 \}, \{ 1 \}, \{ 0, 1 \} \right\}\). This involves assigning probabilities to all of the events. By construction, we know that \[ \Pr[\{ \}] = 0 \qquad \Pr[\{ 0, 1 \}] = 1. \] By the countable additive property of measures, and the fact that events \(\{ 0 \}\) and \(\{ 1 \}\) are disjoint, it follows that \[ \Pr[\{ 0 \}] + \Pr[\{ 1 \}] = \Pr[\{ 0 \} \cup \{ 1 \}] = 1, \] and therefore \[ \Pr[\{ 0 \}] = 1 - \Pr[\{ 1 \}]. \] For example, let \(\Pr[\{ 1 \}] = 0.7\) so that \(\Pr[\{ 0 \}] = 0.3.\)

If we take the sample space to represent the boolean outcome of a football match (1 for the home team winning, 0 for the visiting team winning), then \(\Pr[\{ 1 \}]\) is the probability that the home team wins and \(\Pr[\{ 0 \}]\) is the probability that the visiting team wins.

Example 1(b), cont. We can define multiple measures over the power set of the integers. These are defined by supplying probabilities to the atoms, i.e., the integers. In this example, we will take \(\Pr[\{ n \}] = 2^-{n + 1}.\) This assigns a probability to the event \(\{ n \}\) equal to that of flipping \(n\) successive heads with a fair coin. Countable additivity gives us \(\Pr[\mathbb{N}] = 1,\) because \(\sum_{n \in \mathbb{N}} 2^{n - 1} = \frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \cdots = 1.\)

from which it follows from countable additivity that for any \(A \subseteq \mathbb{N},\) that \(\Pr[ A \] = \sum{a \in A} \Pr[\{ a \}].\)

Example 2(a), cont. We can define the Lebesgue measure over intervals \((a, b) \subseteq (0, 1)\) by \[ \Pr[(a, b)] = b - a. \] This yields the expected value for the whole space, \[ \Pr[\Omega] = \Pr[(0, 1)] = 1 - 0 = 1. \] The probability for unions of non-intersecting intervals is defined by countable additivity. Singleton sets have measure zero, because \[ \Pr[\{ a \}] = \Pr[(a, a)] = a - a = 0. \]
We could have used \(\Omega = [0, 1]\) as an example, but it is different geometrically than \((0, 1)\) because it does not contain its limits at the end points; the open interval will be more convenient for this example going forward.

Example 2(b), cont. The Lebesgue measure over \(\Omega = \mathbb{R}\) does not form a probability space, because the measure of the whole space \(\mathbb{R}\) is not finite. On the other hand, we can define a measure over \(\Omega = \mathbb{R}\) by defining the probability of an interval to be \[ \Pr[(a, b)] = \textrm{logit}^{-1}(b) - \textrm{logit}^{-1}(a), \] where for \(u \in (0, 1)\) and \(v \in \mathbb{R},\) \[ \textrm{logit}(u) = \log \frac{u}{1 - u} \qquad \qquad \textrm{logit}^{-1}(v) = \frac{\exp(v)}{1 + \exp(v)}. \] This has the desired property that \[ \Pr[\Omega] = \lim_{\begin{array}{c}a \rightarrow -\infty \\ b \rightarrow \infty\end{array}} \Pr[(a, b)] = \lim_{\begin{array}{c}a \rightarrow -\infty \\ b \rightarrow \infty\end{array}} \textrm{logit}^{-1}(b) - \textrm{logit}^{-1}(a) = 1 - 0 = 1. \]

Example 3, cont. Suppose we have a product \(\sigma\)-algebra over the open unit square \(\Omega = [0, 1]^2\) (i.e., \([0, 1] \times [0, 1]\)). We can define a basis for a probability measure using open discs for \(x \in (0, 1)^2\), \[ \textrm{D}(x, r) = \left\{ x' : ||x|| < r \right\}. \] The probability of a disc in the unit square (i.e., \(D(x, r) \subset \Omega\)) is given by its area, \[ \Pr[\textrm{D}(x, r)] = \pi \cdot r^2. \] Probabilities for other measurable sets are defined by countable additivity and complementation. In this probability space, a measurable subset of \((0, 1)^2\) has a probability equal to its area (i.e., it’s a Lebesgue measure like Example 2(a)). Thus the entire sample space \(\Omega\) has a probability of 1 because it is a unit square with unit area. This construction works with the volumes of balls in three dimensions and the hypervolumes of hyperballs in higher dimensions.

A.2 Joint and conditional probability

In this section, we will presuppose a fixed probability space \((\Omega, \mathcal{S}, \Pr)\).

Joint probability

If \(A\) and \(B\) are events (i.e., \(A, B \in \mathcal{S}\)), their joint probability is written \(\Pr[A, B]\) and understood as the probability that events \(A\) and \(B\) both occur, so that \[ \Pr[A, B] = \Pr[A \cap B]. \] This notation extends to any number of events, with \(\Pr[A, B, C] = \Pr[A \cap B \cap C]\) and so on.

The sets \(A_1, A_2, \ldots\) are said to be mutually disjoint if \(i \neq j\) implies \(A_i \cap A_j = \emptyset.\) A \(partition\) of the set \(B\) is a countable sequence of mutually disjoint sets \(B_0, B_1, B_2, \ldots\) such that \(B = \bigcup_{n \in \mathbb{N}} \, B_n.\)

The law of total probability states that if \(B_0, B_1, \ldots\) is a partition of \(B\), then \[ \Pr[A, B] = \sum_{n \in \mathbb{N}} \, \Pr[A, B_n]. \] This result follows directly from countable additivity.

In the context of joint probability \(\Pr[A, B]\), we refer to \(\Pr[A]\) as the marginal probability (of \(A\)). It follows from the law of total probability that if \(B_1, B_2, \ldots\) is a partition of \(\Omega,\) then the marginal probability of \(A\) is \[ \Pr[A] = \Pr[A, \Omega] = \sum_n \Pr[A, B_n]. \]

Conditional probability

If \(A\) and \(B\) are events and \(\Pr[B] \neq 0\), the conditional probability of \(A\) given \(B\) is the probability of \(A\) occurring if \(B\) occurs, and is defined by \[ \Pr[A \mid B] = \frac{\Pr[A, B]}{\Pr[B]}. \]

The chain rule for probability follows immediately, \[ \Pr[A, B] = \Pr[A \mid B] \cdot \Pr[B]. \]

Given a fixed event \(B \in \mathcal{S},\) the function \(\Pr[A \mid B]\) forms a measure over \(B.\) More formally, we can derive a new probability space with \(\Omega' = B\), \(\mathcal{S} = \{ A \in \mathcal{S} : A \subseteq B \},\) and \(\Pr'[A] = \Pr[A \mid B].\)

Bayes’s rule

Bayes’s rule allows us to derive a conditional probability \(\Pr[A \mid B]\) given the inverse probability \(\Pr[B \mid A]\) and \(\Pr[A]\). In its simplest form, it’s just an application of the chain rule and definition of conditional probability, where if \(\Pr[B] \neq 0,\) \[ \Pr[A \mid B] = \frac{\Pr[B \mid A] \cdot \Pr[A]}{\Pr[B]}. \] But Bayes’s rule goes further and assumes that if \(A_0, A_1, A_2, \cdots\) is a partition of \(A\) (either finite or countably infinite), then \[ \Pr[A_n \mid B] = \frac{\Pr[B \mid A_n] \cdot \Pr[A_n]} {\sum_{n'} \Pr[B \mid A_{n'}] \cdot \Pr[A_{n'}]}. \]

A.3 Independent events

There are several notions of independence in statistics, which we summarize below.

Independence

A pair of events \(A\) and \(B\) are said to be independent if \[ \Pr[A, B] = \Pr[A] \cdot \Pr[B]. \] When \(A\) and \(B\) are independent, the occurence of \(A\) doesn’t give us any information about \(B\), and vice-versa, so that \[ \Pr[A \mid B] = \Pr[A] \qquad \qquad \Pr[B \mid A] = \Pr[B]. \] If \(A\) and \(B\) are independent, then \(A\) and \(B^\complement\) are also independent, and thus \(\Pr[A \mid B] = \Pr[A \mid B^\complement].\)

Example 3, cont. Consider our Lebesgue measure on \((0, 1)^2,\) where the probability of an event is equal to its area. We can define an event \(A\) by dividing the sample space into an area for \(A\) of 0.2 and an area for \(A^\complement\) of 0.8. In the following diagram, we represent this as a blue horizontal dividing line, with the lower half (with area 0.2) representing the event \(A.\) We do the same for \(B\) with a red vertical dividing line, taking the left half (with area 0.6) to correspond to the event \(B.\)

Code
mydraw(pn.ggplot(pd.DataFrame())
  + pn.scale_y_continuous(breaks=[0.1, 0.6], labels=["A", "-A"], limits = (0, 1), expand=(0, 0))
  + pn.scale_x_continuous(breaks=[0.3, 0.8], labels=["B", "-B"], limits = (0, 1), expand=(0, 0))
  + pn.geom_vline(pn.aes(xintercept=0.6), color='red')
  + pn.geom_hline(pn.aes(yintercept=0.2), color='blue')
  + pn.coord_equal()
  + pn.theme(panel_grid_major = pn.element_blank(),
             panel_grid_minor = pn.element_blank(),
             axis_title = pn.element_blank(),
         axis_ticks = pn.element_blank())
)

We have \(\Pr[A] = 0.2\) and \(\Pr[B] = 0.6.\) We can work out that \[ \Pr[A \mid B] = \frac{\Pr[A, B]}{\Pr[B]} = \frac{0.12}{0.6} = 0.2 = \Pr[A], \] and \[ \Pr[B \mid A] = \frac{\Pr[B, A]}{\Pr[A]} = .\frac{0.12}{0.2} = 0.6 = \Pr[B]. \] Thus the events \(A\) and \(B\) are independent. We can also reason geometrically, noting that the ratio of \(A\)’s area to the total space’s area is the same as the ratio of \(A \cap B\)’s area to the area of \(B.\) Now consider the following manner of partitioning the sample space.

Code
mydraw(pn.ggplot(pd.DataFrame())
  + pn.scale_x_continuous(breaks=[0.3, 0.8], labels=["C", "-C"], limits = (0, 1), expand=(0, 0))
  + pn.scale_y_continuous(breaks=[0.1, 0.6], labels=["A", "-A"], limits = (0, 1), expand=(0, 0))
  + pn.geom_segment(pn.aes(x=0.2, y=1, xend=0.7, yend=0), color='red')
  + pn.coord_equal()
  + pn.geom_hline(pn.aes(yintercept=0.2), color='blue')
  + pn.theme(panel_grid_major = pn.element_blank(),
             panel_grid_minor = pn.element_blank(),
         axis_title = pn.element_blank(),
         axis_ticks = pn.element_blank())
)

Geometric reasoning here shows that \(\Pr[A \mid C^\complement] < \Pr[A \mid C]\) and thus \(A\) and \(C\) are not independent.

Conditional independence

A pair of events \(A\) and \(B\) are said to be conditionally independent given an event \(C\) with \(\Pr[C] > 0\) if \[ \Pr[A, B \mid C] = \Pr[A \mid C] \cdot \Pr[B \mid C]. \] As with independence, if \(A\) and \(B\) are conditionally independent given \(C\), then \(A\) doesn’t provide any information about the occurrence of \(B\) if \(C\) occurs, \[ \Pr[A \mid B, C] = \Pr[A \mid C] \qquad \qquad \Pr[B \mid A, C] = \Pr[B \mid C]. \]

Mutual independence

A sequence of events \(A_0, A_1, \ldots, A_N\) is said to be mutually independent if

  • \(\Pr\!\left[\bigcap_{n = 1}^N A_n\right] = \prod_{n=1}^N \Pr[A_n],\) and

  • every proper subsequence \(A_m, A_{m+1}, \ldots, A_{M}\) is mutually independent, where a proper subsequence is one in which \(m \neq 0\) or \(M \neq N\) (i.e., it is not the whole sequence).

Conditional mutual independence is defined the same way.

A.4 Random variables

In this section we introduce univariate and multivariate random variables and show how to define new random variables as functions of random variables.

Random variables

Given a probability space \((\Omega, \mathcal{S}, \Pr)\), a random variable over this space is a function \(X:\Omega \rightarrow \mathbb{R}\) such that for every \(x \in \mathbb{R}\), the set \(\{ \omega : X(\omega) \leq x \} \in \mathcal{S}\) (i.e., it is is measurable). A random variable is a deterministic function, and as such is neither a variable nor random by itself. The name derives because we often use simple variables like \(X\) to refer to random variables in notation and because it derives its random structure from the underlying probability measure \(\Pr,\) as we show in the rest of this section.

Given a random variable \(X\), we define its range as usual for a function, \[ \textrm{range}(X) = \{ X(\omega) : \omega \in \Omega \}. \] Because \(X(\omega) \in \mathbb{R}\), the range is always a subset of real numbers. It is conventional to overload the membership operator for random variables and write \(X \in A\) if \(X(\omega) \in A\) for every \(\omega \in \Omega.\)

Example 1, cont. Our first example has a finite sample space, \(\Omega = \{ 0, 1 \}.\) Over this sample space, we can define a random variable \(X:\Omega \rightarrow \mathbb{R}\) by \(X(0) = -2\) and \(X(1) = 10\). This has \(\textrm{range}(X) = \{ -2, 10 \}\) and represents the payoff of a 2 unit bet placed on the outcome \(X = 1\) at 5:1 odds.

Example 2(a), cont. In this ongoing example, we have \(\Omega = (0, 1)\). Because \(\Omega \subseteq \mathbb{R}\), we can define a random variable \(X\) by the identity function, \[ X(\omega) = \omega, \] with a range equal to the sample space, \(\textrm{range}(X) = (0, 1) = \Omega.\) The range of a random variable is not constrained by \(\Omega\). Consider the random variable \(Y\) over \(\Omega = (0, 1)\) defined by \[ Y(\omega) = \textrm{logit}(\omega) & = \log \frac{\omega}{1 - \omega} \textrm{ if} \omega \in (0, 1). \] A third example is the random variable \(Z\) defined by \(Z(\omega) = \log \omega\), which has \(\textrm{range}(Z) = (-\infty, 0)\). We can even have a random variable \(T\) over \(\Omega = (0, 1)\) with a discrete range by defining \[ T(\omega) = \begin{cases} 1 & \textrm{if } \omega < 0.3, \textrm{ and} \\ 0 & \textrm{if } \omega \geq 0.3, \end{cases} \] so that \(\textrm{range}(T) = \{ 0, 1 \}.\)

Example 2(b), cont. We previously defined a probability space over the sample space \(\Omega = \mathbb{R}\). We can define a random variable \(U\) over \(\Omega\) by the identity function, \[ U(\omega) = \omega, \] which gives us \(\textrm{range}(U) = \mathbb{R}.\) We can define a second random variable \(V\) by \[ V(\omega) = \textrm{logit}^{-1}(\omega) = \frac{1}{1 + \exp(-\omega)}, \] which has \(\textrm{range}(V) = (0, 1)\). A third example is \(W\) defined by \(W(\omega) = \omega^2\), which has \(\textrm{range}(W) = [0, \infty)\).

Independence for random variables

Two random variables \(X\) and \(Y\) are said to be independent if the events picked out by \(X \leq x\) and \(Y \leq y\) (i.e., \(\{ \omega \in \Omega : X(\omega) \leq x \}\)) are independent for all \(x\) and \(y.\)

Functions of random variables

Random variables partly derive their name because we can use them in arithmetic operations and as arguments to functions to produce new random variables. Given a random variable \(X\) defined over a sample space \(\Omega\) and a function a function \(f:\textrm{range}(X) \rightarrow \mathbb{R}\), we can define a new random variable \(Y\) over the same sample space by setting \[ Y(\omega) = f(X(\omega)). \] This new random variable is conventionally written as \(Y = f(X),\) which is yet another overload of conventional set-theory notation.

Functions of more than one random variable work the same way. If \(X_1, \ldots, X_N\) are random variables and \(f:\textrm{range}(X_1) \times \cdots \times \textrm{range}(X_N) \rightarrow \mathbb{R}\), then we write \(Y = f(X_1, \ldots, X_N)\) for the random variable defined by \[ Y(\omega) = f(X_1(\omega), \ldots, X_N(\omega)). \]

Example 2(a), cont. Although we have given the random variables \(X,\) \(Y,\) and \(T\) direct definitions, note that \(Y = \textrm{logit}(X),\) \(Z = \textrm{log}(X),\) and \(T = \textrm{I}(X < 0.3).\)

Example 2(b), cont. In this example, we have defined \(U,\) \(V,\) and \(W\) as primitive random variables, but we also have \(V = \textrm{logit}^{-1}(U)\) and \(W = \textrm{U}^2.\)

A.5 Cdfs, pdfs, and pmfs

In this section, we introduce cumulative distribution functions (cdfs), probability density functions (pdfs), and probability mass functions (pmfs).

Cumulative distribution function

Given a random variable \(X\) defined over a probability space \((\Omega, \mathcal{S}, \Pr),\) its cumulative distribution function (cdf) \(F_X:\mathbb{R} \rightarrow [0, 1]\) is defined for \(x \in \mathbb{R}\) by \[ F_X(x) = \Pr[X \leq x], \] where we read \(X \leq x\) as the event \(\{ \omega \in \Omega : X(\omega) \leq x \}\). The cdf of a random variable is well-defined because all of the relevant events \(X \leq x\) must be measurable in order for \(X\) to be a random variable.

The cdf \(F_X\) of any random variable \(X\) satisfies three important properties,

  • \(F_X\) is continuous from the right,
  • \(F_X\) is monotonic increasing (but not necessarily strictly so), and
  • the limit as the argument goes to infinity is one, \[ \lim_{x \rightarrow \infty} F_X(x) = 1. \]

Example 1, cont.: Continuing our minimal discrete example with \(\Omega = \{ 0, 1 \}\), \(\Pr[\{ 0 \}] = 0.3,\) and \(\Pr[\{ 1 \}] = 0.7,\) consider a random variable that takes \(X(0) = -2\) and \(X(1) = 10\). The resulting cdf is \[ F_X(x) = \begin{cases} 0 & \textrm{if } x \in (-\infty, 2), \\ 0.3 & \textrm{if } x \in [-2, 10), \textrm{ and} \\ 1.0 & \textrm{if } x \in [10, \infty). \end{cases} \]

Example 2(a), cont. Taking up our example of \(\Omega = (0, 1),\) with \(\Pr[(a, b)] = b - a\) for \((a, b) \subseteq (0, 1),\) we see that the random variable \(X(\omega) = \omega\) has the cdf \(F_X:\mathbb{R} \rightarrow [0, 1]\) defined by \[ F_X(x) = \begin{cases} 0 & \textrm{if } x \in (-\infty, 0), \\ x & \textrm{if } x \in [0, 1), \textrm{ and} \\ 1 & \textrm{if } x \in [1, \infty). \end{cases} \] The result for \(x \in [0, 1)\) follows because \(F_X(x) = \Pr[X \leq x] = \Pr[(0, x)] = x.\)

For the case of \(Y(\omega) = \textrm{logit}(\omega),\) we have the cdf \(F_Y:\mathbb{R} \rightarrow [0, 1]\) defined by \[\begin{align} F_Y(y) &= \Pr[Y \leq y] \\[4pt] &= \Pr[\textrm{logit}^{-1}(Y) \leq \textrm{logit}^{-1}(y)] \\[4pt] &= \Pr[X \leq \textrm{logit}^{-1}(y)] \\[4pt] &= \textrm{logit}^{-1}(y). \end{align}\]

Example 2(b), cont. In this example, \(\Omega = \mathbb{R}\) and \(\Pr[(a, b)] = \textrm{logit}^{-1}(b) - \textrm{logit}^{-1}(a).\) Let the random variable \(U\) be the identity i.e., \(U(\omega) = \omega.\) The cdf of \(U\) is the function \(F_U:\mathbb{R} \rightarrow [0, 1]\) defined by \[\begin{align} F_U(u) &= \Pr[U \leq u] \\[3pt] &= \lim_{x \rightarrow -\infty} \, \Pr[U \in (x, u)] \\[3pt] &= \lim_{x \rightarrow -\infty} \, \textrm{logit}^{-1}(u) - \textrm{logit}^{-1}(x) \\[3pt] &= \textrm{logit}^{-1}(u) - \lim_{x \rightarrow -\infty} \, \textrm{logit}^{-1}(x) \\[3pt] &= \textrm{logit}^{-1}(u). \end{align}\] This is the same cumulative distribution function as we had for \(Y\) in Example 2(a). Consider a second random variable \(V = \textrm{logit}^{-1}(U)\). The cdf of \(V\) is given by \[ F_V(v) = \begin{cases} 0 & \textrm{if } v \in (-\infty, 0) \\ v & \textrm{if } v \in [0, 1) \\ 1 & \textrm{if } v \in [1, \infty). \end{cases} \] The middle case follows because \[\begin{align} F_V(v) &= \Pr[V \leq v] \\[4pt] &= \Pr[\textrm{logit}(V) \leq \textrm{logit}(v)] \\[4pt] &= \Pr[U \leq \textrm{logit}(v)] \\[4pt] &= \textrm{logit}^{-1}(\textrm{logit}(v)) \\[3pt] &= v. \end{align}\] Thus we see that \(F_V\) in this example has the same cumulative distribution as \(F_X\) in example 2(a).

Probability density function

If the cumulative distribution function \(F_X\) for a random variable \(X\) is continuous and differentiable, we define its probability density function (pdf) \(p_X:\mathbb{R} \rightarrow (0, \infty)\) by \[ p_X(x) = F_X'(x), \] for \(x \in \mathbb{R},\) where \(F_X'\) is the derivative of the cdf. Thus the unit of measure for the cdf is probability, whereas the unit of measure for the pdf is change in probability.

Even though the cdf forms the theoretical foundation for distributions, in practice we usually work with pdfs and occasionally want to go back to the cdf, which we can do by integration, \[ F_X(x) = \int_{-\infty}^x \, p_X(x) \, \textrm{d}x. \]

We know that \(\Pr[\Omega] = 1\) and that for any random variable \(X,\) \(\Pr[\Omega] = \lim_{x \rightarrow \infty} \, F_X(x) = 1\). Furthermore, we know \(F_X(x) = \int_{-\infty}^x p_X(x) \textrm{d}x.\) Putting this together we have

\[\begin{align} \int_{-\infty}^\infty p_X(x) \textrm{d}x &= \lim_{x \rightarrow \infty} \int_{-\infty}^x p_X(x) \textrm{d}x \\[4pt] &= \lim_{x \rightarrow \infty} F_X(x) \\[4pt] &= 1. \end{align}\]

Example 2(a), cont. We have a random variable \(X\) with cdf \(F_X(x) = x,\) for \(x \in X,\) so the pdf is \[ p_X(x) = F_X'(x) = \frac{\textrm{d}}{\textrm{d}x} x = 1. \] A constant density indicates the variable has a uniform distribution. The uniform distribution has an associated function for \(\alpha < \beta \in \mathbb{R},\) \[ \textrm{uniform}(x \mid \alpha, \beta) = \frac{1}{\beta - \alpha}. \] If a random variable \(X\) has a density \(p_X(x) = \textrm{uniform}(x \mid \alpha, \beta),\) we write \[ X \sim \textrm{uniform}(\alpha, \beta). \] In this notation, \(\alpha\) and \(\beta\) might themselves be random variables, which require the notion of conditional density, which we define below.

For our second random variable \(Y\), the cdf is \(F_Y(y) = \textrm{logit}^{-1}(y),\) and so the pdf is \[ p_Y(y) = F_Y'(y) = \frac{\textrm{d}}{\textrm{d}y} \textrm{logit}^{-1}(y) = \textrm{logit}^{-1}(y) \cdot (1 - \textrm{logit}^{-1}(y)). \] Variable \(Y\) has a standard logistic distribution, and we write \[ Y \sim \textrm{logistic}(0, 1). \] The general definition of the logistic distribution’s density involves parameters \(\mu \in \mathbb{R}\) and \(\sigma \in (0, \infty),\) \[ \textrm{logistic}(y \mid \mu, \sigma) = \textrm{logit}^{-1}\!\left(\frac{y - \mu}{\sigma}\right) \cdot \left(1 - \textrm{logit}^{-1}\!\left(\frac{y - \mu}{\sigma}\right)\right). \] The use of \(\mu\) and \(\sigma\) in this configuration defines what’s known as a location-scale parameterization. The standard parameterization takes the location to be zero and the scale to be one. If \(Y \sim \textrm{logistic}(0, 1)\) (i.e., it has a standard logistic distribution), then \(Z = \mu + \sigma \cdot Y\) is such that \(Z \sim \textrm{logistic}(\mu, \sigma).\) The parameter \(\sigma\) used to scale is called a scale parametrer and the parameter \(\mu\) used to translate or shift is called a location parameter.

The following code and plot shows the cdf and pdf for the standard logistic distribution.

Code
def logit(x):
    return np.log(x / (1 - x))
def inv_logit(x):
    return 1 / (1 + np.exp(-x))

x = np.linspace(-8, 8, 100)
pdf_y = [inv_logit(x_n) * (1 - inv_logit(x_n)) for x_n in x]
df_pdf = pd.DataFrame({
    'x': x, 'y': pdf_y, 'type': 'pdf p_X'
})
cdf_y = [inv_logit(x_n) for x_n in x]
df_cdf = pd.DataFrame({
    'x': x, 'y': cdf_y, 'type': 'cdf F_X'
})
df = pd.concat([df_pdf, df_cdf])
mydraw(pn.ggplot(df, pn.aes(x='x', y='y')) + pn.geom_line()
    + pn.facet_wrap('~ type', scales = 'free_y')
    + pn.labs(x='x', y='value', title = 'X ~ logistic(0, 1)')
    + pn.scale_x_continuous(breaks = [-8, -6, -4, -2, 0, 2, 4, 6, 8])
    + pn.theme(figure_size=(10, 4), panel_spacing=0.5))

The side-by-side plots illustrate how the pdf \(p_X\) is the derivative of the cdf \(F_X.\) The derivative of the cdf, and hence the pdf, is maximized at \(X = 0,\) and approaches zero as \(X \rightarrow \infty\) or \(X \rightarrow -\infty.\) The chain rule for derivatives cancels the outer negation and establishes symmetry, \[ F_X(x) = 1 - F_X(-x) \qquad \qquad F_X'(x) = F_X'(-x) \qquad \qquad p_X(x) = p_X(-x). \]

Change of variables (univariate)

Suppose I have a univariate random variable \(X:\Omega \rightarrow \mathbb{R}\) and a function \(f:\mathbb{R} \rightarrow \mathbb{R}\). Then I can define a new random variable \(Y = f(X)\) by defining \[ Y(\omega) = f(X(\omega)) \] for a point \(\omega \in \Omega\) in the sample space.

If the function \(f\) is a differentiable, monotonic (up or down), and one-to-one, then the standard change of variables formula for densities applies, so that the density function for \(Y\) is \[ p_Y(y) = p_X(f^{-1}(y)) \cdot (f^{-1})'(y), \] where \(f^{-1}\) is the inverse of \(f\) and \((f^{-1})'\) is the derivative of the inverse of \(f\). We consider multivariate changes of variables when we introduce vector-based random variables below.

Probability mass function

If the cumulative distribution function \(F_X\) for a random variable \(X\) is discrete (in the sense of being made up only of step functions), then we define its probability mass function (pmf) \(p_X:\mathbb{N} \rightarrow (0, \infty)\) by \[ p_X(n) = \Pr[X \leq n + 1] - \Pr[X \leq n] = F_X(n + 1) - F_X(n), \] for \(n \in \mathbb{N}\).

The support of a random variable \(X\) is the set of values it can take on, which is formally defined as the values for which the pdf or pmf is strictly greater than 0, \[ \textrm{supp}(X) = \{ x : p_X(x) > 0 \}. \]

Example 4. Let \(U\) be the result of rolling a fair 6-sided die and \(V\) be the result of rolling another fair 6-sided die independently, and let \(X = U + V\) be the sum of the dice. The following plots show the pmf and pdf for \(X\), which can take on values between 2 and 12. There are \(6 \times 6 = 36\) different primitive outcomes, and each outcome has a probability equal to the total number of ways it can arise divided by 36 (e.g., \(\Pr[X = 2] = 1/36\) because \(X=2\) only occurs one way, with \(U = 1, V = 1,\) whereas \(\Pr[X = 5] = 4/36\) because \(X = 5\) can arise as four different combinations of \(U\) and \(V\).

Code
support = np.arange(2, 13)
probs = [0, 1/36, 2/36, 3/36, 4/36, 5/36, 6/36, 5/36, 4/36, 3/36, 2/36, 1/36]
cdf_values = np.cumsum(probs)
df = pd.DataFrame({
    'n': support,  'cdf': cdf_values[1:12], 'cdf_low': cdf_values[0:11],
})
plot1 = (pn.ggplot(df, pn.aes(x='n', y='cdf'))
    + pn.geom_point(size=2)
    + pn.scale_x_continuous(breaks=np.arange(2, 13))
    + pn.labs(x='n', y='Pr[X <= n]')
    + pn.scale_x_continuous(limits=(1,13),
                            breaks=support, expand=(0, 0))
    + pn.ggtitle('cdf')
    + pn.theme(strip_margin_x=5.0))
for n in range(0,12):
    plot1 += pn.geom_segment(pn.aes(x = n + 1,
               y = cdf_values[n], xend = n + 2 - 0.075,
               yend = cdf_values[n]))
    plot1 += pn.geom_point(pn.aes(x='n', y='cdf_low'),
               size=2, stroke=0.5, fill='none',
           shape='o', color='black')
df2 = pd.DataFrame({ 'n': support, 'probs': probs[1:] })
plot2 = (pn.ggplot(df2, pn.aes(x='n', y='probs'))
         + pn.geom_bar(stat='identity', color='black', fill='white')
         + pn.scale_x_continuous(breaks=support)
         + pn.scale_y_continuous(breaks=np.arange(1, 7) / 36,
               labels=['1/36', '2/36', '3/36', '4/36', '5/36', '6/36'])
         + pn.ggtitle('pmf'))
g12 = (pw.load_ggplot(plot1, figsize=(3.25, 3))
      | pw.load_ggplot(plot2, figsize=(3.25, 3)))
g12.savefig()

The characterization of random variables with pdfs and pmfs is incomplete because not all probability distributions can be classified as discrete or continuous. A random variable \(X\) can have a cdf with both continuous portions and jumps. For example, if we define a mixture of a discrete and continuous cdf (e.g., for a spike-and-slab prior in Bayesian statistics), we get a cdf that is neither discrete nor continuous (we provide an example and plot its cdf below).

Conditional cdfs, pdfs, and pmfs

Conditional cumulative distribution function

If \(X:\Omega \rightarrow \mathbb{R}\) is a random variable and \(A \subseteq \Omega\) is an event with \(\Pr[A] > 0,\) the conditional cumulative distribution function is \[ F_{X \mid A}(x) = \Pr[X \leq x \mid A], \] where we read \(X \leq x\) as the event \(\{ \omega \in \Omega : X(\omega) \leq x \}.\)

Conditional probability mass functions

If \(X\) is a discrete random variable, the conditional probability mass function is \[ p_{X \mid A}(x) = \Pr[X = x \mid A], \] where \(X = x\) picks out the event \(\{ \omega \in \Omega : X(\omega) = x \}.\)

Suppose \(Y:\Omega \rightarrow \mathbb{R}\) is a discrete random variable (i.e., \(\textrm{range}(Y)\) is countable). It is conventional to further overload notation and terminology and define the conditional probability mass function \(p_{X \mid Y}\) by taking \[ p_{X \mid Y}(x \mid y) = \Pr[X = x \mid Y = y]. \] It follows that \[ p_{X \mid Y}(x, y) = \frac{p_{X, Y}(x, y)}{p_Y(y)}. \]

Conditional probability density functions

If \(X\) is continuous, the conditional probability density function is defined by differentiation, \[ p_{X \mid A}(x) = F_{X \mid A}'(x). \]

If \(Y\) is also continuous, we will define a limit that provides the conditional form. Let \(Y^{y, \epsilon}\) be the event determined by the condition \(Y \in (y - |\epsilon|, y + |\epsilon|)\) and define \[ p_{X \mid Y}(x \mid y) = \lim_{\epsilon \rightarrow 0} \ p_{X \mid Y^{\mbox{$\scriptstyle y,\epsilon$}}}(x). \] It follows that \[ p_{X \mid Y}(x \mid y) = \frac{p_{X, Y}(x, y)}{p_Y(y)}. \]

Conditioning on discrete and continuous variables

In practice, we might have a variable (discrete or continuous) whose value depends on the value of one or more discrete variables and one or more continuous variables (e.g., the outcome of a drug trial based on a patient’s discrete age group and continuous blood pressure). In this case, we define the cdf by conditioning on equality for discrete variables and by taking limits of surrounding intervals (balls in higher dimensions) for continuous variables. If the resulting variable is discrete, its pmf is defined by differences of the cdf; if it is continuous, the pdf is defined by differentiating the cdf.

In all cases, we will rely on the fact that for any random variables \(X\) and \(Y\), including multivariate ones, we have \[ p_{X \mid Y}(x \mid y) = \frac{p_{X, Y}(x, y)}{p_Y(y)}. \] Here, we have exploited the overloaded \(p\) notation, which allows for discrete, continuous, or mixed random variables.

Example 5. Suppose we have a continuous random variable \(Y\) that depends on a discrete random variable \(U\) and a continuous random variable \(V.\) Then we define the conditional density as a partial derivative of a limit conditioned on the discrete outcome, \[ p_{Y \mid U, V}(y \mid u, v) = \frac{\partial}{\partial y} \lim_{\epsilon \rightarrow 0} F_{Y \mid U = u, \ V \in (v - |\epsilon|, v+ |\epsilon|)}(y). \]

Mixed continuous and discrete variables

Not all random variables can be classified as continuous or discrete—some have cumulative distribution functions that are continuous in some segments and step functions elsewhere. In our applied examples, such distributions will be constructed from simpler continuous and discrete distributions.

Example 6. Suppose we have a binary random variable \(Z\) (i.e., \(\textrm{range}(Z) = \{ 0, 1 \}\)) and a continuous random variable \(U\) with \(\textrm{range}(U) = \mathbb{R}.\) We can define a new random variable \(Y\) by \[ Y = \begin{cases} 0 & \textrm{if } Z = 1, \textrm{ and} \\[2pt] U & \textrm{if } Z = 0. \end{cases} \] The variable \(Y\) will have \(\textrm{range}(Y) = \mathbb{R},\) so it is not technically discrete. Nevertheless, we see that \(\Pr[Y = 0] = \Pr[Z = 1],\) which cannot happen in a continuous distribution (where all points have zero probability). The cdf \(F_Y\) is equal to the cdf \(F_U\) scaled by the probability \(Z = 0\), with a jump at zero of \(\Pr[Z = 1],\) \[ F_Y(y) = \begin{cases} F_U(y) \cdot \Pr[Z = 0] & \textrm{if } y < 0, \textrm{ and} \\[4pt] \Pr[Z = 1] + F_U(y) \cdot \Pr[Z = 0] & \textrm{if } y > 0. \end{cases} \] Zero-inflated continuous distributions are sometimes called spike and slab distributions, because the probability mass at zero acts as a “spike” (i.e., as a delta function for integration), with the underlying continuous distribution forming the “slab” (especially in when there is a high probability of extreme values).

Here is a plot of the resulting mixed cdf for the case \(\Pr[Z = 1] = 0.4\), with the standard logistic distribution as the continuous component (i.e., with cdf \(\textrm{logit}^{-1}.)\)

Code
p = 0.4
def mix_cdf(y):
    cdf_y = (1 - p) * inv_logit(y)
    return cdf_y if y < 0 else cdf_y + p
y_left = np.linspace(-8, -0.065, 50)
y_right = np.linspace(0, 8, 50)
F_Y_left = [mix_cdf(y_n) for y_n in y_left]
F_Y_right = [mix_cdf(y_n) for y_n in y_right]
df_left = pd.DataFrame({ 'y': y_left, 'F_Y': F_Y_left })
df_right = pd.DataFrame({ 'y': y_right, 'F_Y': F_Y_right })
mydraw(pn.ggplot(pn.aes(x = 'y', y = 'F_Y'))
    + pn.geom_line(data = df_left)
    + pn.geom_line(data = df_right)
    + pn.geom_point(data = pd.DataFrame({'x': [0], 'y': [(1 - p) / 2]}),
                    mapping = pn.aes(x = 'x', y = 'y'), size=2,
            shape='o', stroke = 0.5, fill='none')
    + pn.geom_point(data = pd.DataFrame({'x': [0], 'y': [1 - (1 - p) / 2]}),
                    mapping = pn.aes(x = 'x', y = 'y'), size=2)
    + pn.scale_x_continuous(breaks = np.linspace(-8, 8, 9))
    + pn.scale_y_continuous(breaks = [0, 0.3, 0.7, 1])
)

A.6 Multivariate random variables

Although we have only described univariate random variables that take values in \(\mathbb{R}\), if we take a collection of random variables \(X_1, \ldots, X_N\) we can combine them into a multivariate random variable \(X:\Omega \rightarrow \mathbb{R}^N\), by taking \[ X(\omega) = [X_1(\omega) \ \cdots \ X_N(\omega)]. \] In this case, we can write \(X = [X_1 \ \cdots \ X_N].\) The same construction may be naturally extended to matrices, tensors, etc.

Example 4. Consider a sample space \((0, 1)^2\) with a Borel \(\sigma\)-algebra and probability function given by area (i.e., the Lebesgue measure), so that if \(A \subseteq (0,1)^2,\) we have \(\Pr[A] = \int_{(0,1)^2} \, \textrm{I}(x \in A) \, \textrm{d}x.\) We can define a random variable \(X\) by \(X(\omega) = \pi_1(\omega),\) where \(\pi_1\) is the projection function defined by \(\pi_1\!\left((x,y)\strut\right) = x,\) and a second random variable \(Y\) defined by \(Y(\omega) = \pi_2(\omega),\) where \(\pi_2\!\left((x, y)\strut\right) = y.\) We can define a multivariate random variable \(Z = [X \quad Y],\) which could have also been defined directly by \(Z(\omega) = \omega.\)

Functions of multivariate random variables

As in the univariate case, we can directly apply real-valued multivariate functions to a sequence of random variables to define new random variables. For example, if \(f:\textrm{range}(X_1) \times \cdots \times \textrm{range}(X_N) \rightarrow \mathbb{R}^M,\) then we get a new random variable \(Y = f(X)\) defined by \[ Y(\omega) = f(X_1(\omega), \ldots, X_N(\omega)). \]

Example 4, cont. We can take the random variables \(X\) and \(Y\) and define a new random variable \(L = \sqrt{X^2 + Y^2}\) which is the Euclidean distance from the origin to the point \((X, Y)\). We can move back and forth between univariate and multivariate functions. For example, we could have defined \(L\) as a function of the multivariate variable \(Z\) by \(L = \sqrt{Z_1^2 + Z_2^2}.\)

Change of variables (multivariate)

Suppose we have a random variable \(X \in \mathbb{R}^N\) and a smooth, invertible function \(f:\mathbb{R}^N \rightarrow \mathbb{R}^N.\) Then we define \(Y = f(X)\) to be a new random variable, and note that its density is \[ p_Y(y) = p_X\!\left(f^{-1}(y)\right) \cdot \left| \nabla \! \left( f^{-1} \right) \! \left(y\right) \right|, \] where \(\nabla g\) is read as the Jacobian of the function \(g\) and \(\left| A \right|\) is read as the absolute determinant of the matrix \(A.\) Here the function \(g = f^{-1}\) maps \(\mathbb{R}^N\) to \(\mathbb{R}^N,\) so \(\nabla\!\left(f^{-1}\right)\) is an \(N \times N\) matrix.

Multivariate cdfs, pmfs, and pdfs

The cumulative distribution function for a multivariate random variable \(X:\Omega \rightarrow \mathbb{R}^N\) is defined by \[ F_X(x_1, \ldots, x_N) = \Pr[x_1 \leq X_1, \cdots, x_N \leq X_N] = \Pr[\{ x \in \mathbb{R}^N : x_1 \leq X_1, \ldots, x_N \leq X_N \}]. \]

The probability mass function for a discrete multivariate random variable is \[ p_X(x_1, \ldots, x_N) = \Pr[X_1 = x_1, \ldots, X_N = x_n]. \]

The probability density function for a continuous multivariate random variable \(X \in \mathbb{R}^N\) is defined as the function \(p_X:\mathbb{R}^N \rightarrow [0, \infty)\) that has the property that for any event \(A \in \mathcal{S},\) \[ \Pr[x \in A] = \int_A p_X(x) \textrm{d} x, \] or in terms of rectangular integrations and the cdf, \[ F_X(x) = \int_0^{x_1} \cdots \int_0^{x_N} p_X([x_1 \quad \cdots \quad x_N]) \, \textrm{d}x_N \cdots \textrm{d}x_1. \]

We can marginalize a joint cdf with a limit, \[ F_X(x) = \lim_{y \rightarrow \infty} F_{X, Y}(x, y). \] The law of total probability holds for joint densities, \[ p_X(x) = \int_{\mathbb R} \, p_{X, Y}(x, y) \, \textrm{d}y. \]

Conditional cdfs, pmfs, and pdfs are defined in the same way as for univariate random variables. This leads to a common form of marginalization where the joint density is factored, \[ p_X(x) = \int_{\mathbb R} \, p_{X \mid Y}(x \mid y) \cdot p_Y(y) \, \textrm{d}y. \]

A.7 Expectations

Expectation

If we have a (potentially multivariate) random variable \(Z\), we write \(\mathbb{E}[Z]\) for its expectation, which despite the fancy name, is just its average value, weighted by its density, \[ \mathbb{E}[Z] = \int_{\textrm{supp}(Z)} z \cdot p_Z(z) \, \textrm{d}z. \] Discussing random variables presupposes a background probability space \((\Omega, \mathcal{S}, \Pr)\) with respect to which the cumulative distribution function \(F_Z\) and the probability density function \(p_Z\) are defined.

Statistical notation overloads many notations, and commonly just \(Z\) rather than \(\textrm{supp}(Z)\) is used to subscript the integral.

Expectations are linear operators in that if \(a, b\) are constants and \(X\) is a random variable, then \[ \mathbb{E}[a + b \cdot X] = a = b \cdot \mathbb{E}[X]. \] Expectations are also additive in random variables, \[ \mathbb{E}[X + Y] = \mathbb{E}[X] = \mathbb{E}[Y]. \] If \(X\) and \(Y\) are independent, then we also have \[ \mathbb{E}[X \cdot Y] = \mathbb{E}[X] \cdot \mathbb{E}[Y]. \]

Conditional expectation

With Bayesian statistics, we are typically interested in conditional expectations, which are the value of a random variable conditioned on the observed value of a second random variable. Suppose \(Y\) is a second random variable over the same implicit probability space and we observe that \(Y = y\). The conditional expectation of \(Z\) given \(Y = y\) is \[ \mathbb{E}[Z \mid Y = y] = \int_{Z} z \cdot p_{Z \mid Y}(z \mid y) \, \textrm{d}z. \] That is, we take a weighted average of the value of \(Z\) with weights determined by the conditional density \(p_{Z \mid Y}(z \mid y)\).

Expectations of functions of random variables

Suppose we have a random variable \(Z\) and a real-valued function \(f:\textrm{supp}(Z) \rightarrow \mathbb{R}^N\). We can then define a new random variable \(W = f(Z)\). To calculate the expectation of \(W\) in terms of the density for \(Z\), we use the so-called law of the unconscious statistician, which tells us we can simplify the change of variables back to a direct average over \(Z\), \[\begin{align} \mathbb{E}[f(Z)] &= \mathbb{E}[W] \\[4pt] &= \int_W w \cdot p_W(w) \, \textrm{d}w. \\[4pt] &= \int_W w \cdot p_Z(f^{-1}(w)) \cdot (f^{-1})'(w) \, \textrm{d}w. \\[4pt] &= \int_Z f(z) \cdot p_Z(z) \, \textrm{d}z, \end{align}\] where we have written \(f^{-1}\) for the inverse transfrom form \(W\) back to \(Z\) and \((f^{-1})'\) for its derivative. The first line is the definition, the second is a standard univariate change of variables (see above), but deriving the third line in the general case requires measure-theoretic techniques beyond this short introduction.

The conditional form follows suit, \[ \mathbb{E}[f(Z) \mid Y = y] = \int_Z f(z) \cdot p_{Z \mid Y}(z \mid y) \, \textrm{d}z. \] Either way, we just take a weighted average the value of \(f(z)\) with weights \(p_Z(z)\) or \(p_{Z|Y}(z \mid y).\)

A.8 Variance, standard deviation, covariance, and correlation

Variance

The variance of a random variable is the expectation of its squared difference from the expectation, \[ \textrm{var}[X] = \mathbb{E}\!\left[ \left( X - \mathbb{E}[X] \right)^2 \right]. \] Variance can be decomposed as follows. \[\begin{align} \textrm{var}[X] &= \mathbb{E}\!\left[ \left( X - \mathbb{E}[X] \right)^2 \right] \\[4pt] &= \mathbb{E}\!\left[ \, X^2 - 2 \cdot X \cdot \mathbb{E}[X] + \mathbb{E}[X]^2 \, \right] \\[4pt] &= \mathbb{E}\!\left[X^2 \right] - \mathbb{E}\!\left[2 \cdot X \cdot \mathbb{E}[X]\right] + \mathbb{E}\!\left[\mathbb{E}[X]^2\right] \\[4pt] &= \mathbb{E}\!\left[X^2 \right] - 2 \cdot \mathbb{E}[X] \cdot \mathbb{E}[X] + \mathbb{E}[X]^2 \\[4pt] &= \mathbb{E}\!\left[X^2\right] - \mathbb{E}[X]^2. \end{align}\]

If \(X\) represents a distance measured in meters, then the variance \(\textrm{var}[X]\) is expressed in square meters. This can make it tricky to reason about variances.

Adding a constant \(c \in \mathbb{R}\) to a random variable \(X\) does not change its variance, \[ \textrm{var}[c + X] = \textrm{var}[X]. \] Multiplying the variance requires quadratic scaling, \[ \textrm{var}[c \cdot X] = c^2 \cdot \textrm{var}[X]. \] The variance of a sum is the sum of the variances, \[ \textrm{var}[U + V] = \textrm{var}[U] + \textrm{var}[V]. \]

Standard deviation

The standard deviation of a random variable is the square root of its variance, \[ \textrm{sd}[X] = \sqrt{\textrm{var}[X]}. \] The units of the standard deviation are identical to those of the random variable \(X\). Thus we say that the standard deviation is the scale of a variable’s variation.

Standard deviations of a random variable \(X\) have the same scale as the random variable \(X\) itself. They also drop additive constants, but unlike variances, they scale linearly with multiplication due to their scale. For any constant \(c \in \mathbb{R},\) we have \[\begin{align} \textrm{sd}[c + X] &= \textrm{sd}[X] \\[4pt] \textrm{sd}[c \cdot X] &= c \cdot \textrm{sd}[X]. \end{align}\] The standard deviation of a sum works through the fact that variances are additive, \[ \textrm{sd}[U + V] = \sqrt{\textrm{sd}[U]^2 + \textrm{sd}[V]^2}. \]

Covariance

If we have two random variables \(X\) and \(Y\), their covariance is defined as \[ \textrm{cov}[X, Y] = \mathbb{E}\!\left[\strut (X - \mathbb{E}[X]) \cdot (Y - \mathbb{E}[Y])\right]. \] Covariance is measured in units that are the product of the units of \(X\) and the units of \(Y\). For example if \(X\) is watts and \(Y\) is hours, then the units of \(\textrm{cov}[X, Y]\) is watt-hours. As an operation, covariance is commutative in that \(\textrm{cov}[X,Y] = \textrm{cov}[Y, X].\)

The covariance between a variable and itself is just its variance, \[ \textrm{cov}[X, X] = \mathbb{E}\!\left[ \strut (X - \mathbb{E}[X]) \cdot (X - \mathbb{E}[X])\right] = \mathbb{E}\!\left[ \strut (X - \mathbb{E}[X])^2\right] = \textrm{var}[X]. \]

Example 6. We can construct a pair of scalar random variables \((X, Y)\) that covary by generating data according to a simple linear regression with \(\mathbb{E}[Y] = \alpha + \beta \cdot X,\) \[\begin{align} p_X(x) &= \textrm{normal}(x \mid 0, \tau), \textrm{ and} \\[4pt] p_{Y \mid X}(y \mid x) &= \textrm{normal}(y \mid \alpha + \beta \cdot x, \sigma), \end{align}\] where \(\sigma, \tau \in (0, \infty)\) are scale parameters (i.e., determine the distribution’s standard deviation). Because the normal distribution uses a location-scale parameterization, we can also construct \(Y\) using a simple change of variables using a new variable \(E\) corresponding to the “error,” \[\begin{align} Y &= \alpha + \beta \cdot X + E \\[4pt] p_E(\epsilon) &= \textrm{normal}(0, \sigma). \end{align}\] We know from regression theory that if \(Y = \alpha + \beta \cdot X + E\) where \(E\) has a normal distribution, then \[ \beta = \frac{\textrm{cov}[X, Y]}{\textrm{var}[X]}, \] and hence \[ \textrm{cov}[X, Y] = \beta \cdot \textrm{var}[X] = \beta \cdot \tau^2. \] We can validate our mathematical derivation via simulation.

M = 100_000
tau = 2.3
sigma = 1.5
alpha, beta = 1.3, -0.7
x = np.random.normal(loc = 0, scale = tau, size = M)
y = np.random.normal(loc = alpha + beta * x, scale = sigma, size = M)
cov_hat = np.cov(x, y)[0, 1]
print(f"sample cov(x, y) = {cov_hat:5.2f}")
sample cov(x, y) = -3.71

Here, we expect the sample covariance to be very close to \(\beta \cdot \tau^2 = -0.7 \cdot 2.3^2 = -3.703.\)

Correlation

The correlation between two random variables is their inverse scaled covariance, \[ \textrm{corr}[X, Y] = \frac{\textrm{cov}[X, Y]} {\textrm{sd}[X] \cdot \textrm{sd}[Y]}. \] The units all cancel here—the numerator has units in the square of the original variables as does the denominator. Like covariance, correlation is commutative.

The correlation between a variable and itself is the unit, \[ \textrm{corr}[X, X] = \frac{\textrm{var}[X]} {\textrm{sd}[X] \cdot \textrm{sd}[X]} = 1. \]

Example 6, cont. Given that we have computed the covariance between \(X\) and \(Y\), we can compute their correlation by dividing by their scales, \[ \textrm{corr}[X, Y] = \frac{\textrm{cov}[X, Y]}{\textrm{sd}[X] \cdot \textrm{sd}[Y]}. \] We know that \(\textrm{sd}[X] = \tau,\) but \(Y\) is a compound that inherits additional uncertainty form its location parameter being a random variable itself. Nevertheless, because \(Y = \alpha + \epsilon + \beta \cdot X\), where \(\alpha\) is constant and both \(E\) and \(X\) are normal, we just add variances of the varying terms and then take the square root to return to the standard deviation scale, \[ \textrm{sd}[Y] = \sqrt{\textrm{var}[\beta \cdot X] + \textrm{var}[E]} = \sqrt{\beta^2 \cdot \textrm{var}[X] + \textrm{var}[E]} = \sqrt{\beta^2 \cdot \tau^2 + \sigma^2} \approx 2.200. \] Therefore, the correlation will be \[ \textrm{corr}[X, Y] = \frac{\textrm{cov}[X, Y]}{\textrm{sd}[X] \cdot \textrm{sd}[Y]} = \frac{\beta \cdot \tau^2}{\sigma \cdot \sqrt{\beta^2 \cdot \tau^2 + \sigma^2}} = \frac{-0.7 \cdot 2.3^2}{2.3 \cdot 2.2} \approx -0.732. \] The simulation results agree with the mathematics.

print(f"sd(x) = {np.std(x):5.2f};  sd(y) = {np.std(y):5.2f}")
corr_hat = np.corrcoef(x, y)[0, 1]
print(f"sample corr(x, y) = {corr_hat:5.2f}")
sd(x) =  2.30;  sd(y) =  2.20
sample corr(x, y) = -0.73

Covariance and correlation matrices

If we have a random variable \(X:\Omega \rightarrow \mathbb{R}^N\), its covariance matrix is \[ \textrm{cov}[X] = \mathbb{E}\!\left[(X - \mathbb{E}[X]) \cdot (X - \mathbb{E}[X])^{\top}\right]. \] The covariance matrix of \(X\) has entries that are covariances of its elements, \[ \textrm{cov}[X]_{i,j} = \textrm{cov}[X_i, X_j]. \] Covariance matrices are symmetric and positive definite (and hence full rank). The units of the entries are squared relative to the original variable \(X\).

The correlation matrix for \(X\) standardizes the covariance matrix by dividing by the scales, \[ \textrm{corr}[X] = S^{-1} \cdot \textrm{cov}[X] \cdot S^{-1}, \] where \(S\) is the diagonal matrix of scales, \(S_{n,n} = \textrm{sd}[X_n].\) The entries are the correlations between pairs of components, \[ \textrm{corr}[X]_{i,j} = \frac{1}{\textrm{sd}[X_i]} \cdot \textrm{cov}[X_i, X_j] \cdot \frac{1}{\textrm{sd}[X_j]} = \textrm{corr}[X_i, X_j]. \]

A.9 Quantiles, medians, and uncertainty intervals

The quantile function for a continuous random variable \(Z\) is its inverse cdf, \(F^{-1}_Z:(0, 1) \rightarrow \textrm{range}(Z).\) nFor a value \(p \in (0, 1),\) we say that \(F^{-1}_Z(p)\) is the \(p\)-quantile of \(Z\). In terms of probabilities, if \(z = F^{-1}_Z(p),\) then \(\Pr[Z \leq z] = p.\)

The 50% quantile, \(F_Z^{-1}(0.5),\) is known as the median. There is a 50% chance that a random variable takes on a value less than its median and a 50% chance that it takes on a value greater.

We can define posterior intervals for variables using quantile functions. For a random variable \(X\) and \(p \in (0, 1)\) the central \(p\) posterior interval of \(X\) is \(\left( F_X^{-1}\!\left(\frac{1 - p}{2}\right), F_X^{-1}\!\left(1 - \frac{1 - p}{2}\right) \right),\) and it has its nominal coverage by definition, \[ \Pr\!\left[X \in \left( F_X^{-1}\!\left(\frac{1 - p}{2}\right), F_X^{-1}\!\left(1 - \frac{1 - p}{2}\right) \right)\right] = p. \]

B. Bayesian statistics

In this appendix, we provide a concise mathematical overview of Bayesian statistics. Bayesian statistics is largely concerned with estimating quantities of interest based on a joint probability model of the data and parameters, given some observed data. It is also concerned with quantifying our uncertainty in these estimates. Quantities of interest are typically parameter values and deviations, event probability estimates, and forecasts/backcasts.

B.1 Bayes’s rule for densities

Bayes (1763) formalized his approach in the following theorem which was named after him. The basic idea is this, if we have parameters and data, we can calculate the distribution of the parameters given the data knowing on the distribution of the data given the parameters and the prior. Using names, the first step of Bayes’s theorem for densities is \[\begin{align} \underbrace{p(\textrm{params} \mid \textrm{data})}_{\textrm{posterior}} &= \underbrace{p(\textrm{data} \mid \textrm{params})}_{\textrm{sampling}} \cdot \underbrace{p(\textrm{params})}_{\textrm{prior}} \, / \, \underbrace{p(\textrm{data})}_{\textrm{evidence}}. \\[4pt] \end{align}\] In general discussions, we typically use the variable \(y\) for data and \(\theta\) for model parameters. The second step of Bayes’s theorem replaces the evidence by marginalizing data out of the numerator.

Theorem 1 (Bayes’s Theorem) Given a joint density \(p(y, \theta)\), the posterior density \(p(\theta \mid y)\) can be defined in terms that only involve the prior \(p(\theta)\) and sampling distribution \(p(y \mid \theta)\), as \[ p(\theta \mid y) \ = \ \frac{p(y \mid \theta) \cdot p(\theta)} {\int_{\Theta} \, p(y \mid \theta) \cdot p(\theta) \, \textrm{d}\theta}. \]

Proof: \[\begin{align} p(\theta \mid y) &= \frac{p(y, \theta)} {p(y)} & \textrm{[definition of conditional probability]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {p(y)} & \textrm{[chain rule]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {\int_{\Theta} \, p(y, \theta) \, \textrm{d}\theta} & \textrm{[law of total probability]} \\[6pt] &= \frac{p(y \mid \theta) \cdot p(\theta)} {\int_{\Theta} \, p(y \mid \theta) \cdot p(\theta) \, \textrm{d}\theta}. & \textrm{[chain rule]} \end{align}\] \(\blacksquare\)

Bayes’s theorem allows us to solve the so-called inverse problem of inferring the posterior \(p(\theta \mid y)\) when all we have is the sampling distribution \(p(y \mid \theta),\) the prior \(p(\theta)\), and data \(y\).

In most casses, Stan programs do not require normalized log densities. This allows us to go a step further and drop the denominator \(p(y)\), which does not depend on \(\theta\) and write Bayes’s rule up to a proportion as \[ p(\theta \mid y) \ \propto \ p(y \mid \theta) \cdot p(\theta). \]

This lets us proceed with only an unnormalized sampling density \(p(y \mid \theta)\) and unnormalized prior \(p(\theta)\). About the only time that a normalizing constant is required is for comparing two models prior predictive distributions, which is not something we would recommend doing.

B.2 Bayesian inference

Bayesian inference is largely about estimating quantities of interest based on a probability model and observed data. In typical applied Bayesian inference problems, we are interested in three quantities that can be expressed as expectations: parameter estimates, event probabilities, and probabilistic prediction. All of these quantities are expressed as expectations over the posterior, meaning that they involve averaging over our uncertainty in parameter values.

We are also interested in uncertainty, which is defined by the posterior. We typically summarize uncertainty using quantiles, and in particular, posterior intervals (sometimes called “credible intervals”).

Parameter estimation

The first quantity of interest is the value of parameters. The standard Bayesian parameter estimate is the posterior mean, or conditional expectation given the data. Given a model \(p(y, \theta)\) and observed data \(y\), the Bayesian posterior mean estimate of the parameters \(\theta\) is

\[\begin{align} \widehat{\theta} &= \mathbb{E}[\Theta \mid y] \\[6pt] &= \int_{\Theta} \theta \cdot p(\theta \mid y) \, \textrm{d}\theta. \end{align}\]

The posterior mean is the value that minimizes expected square error in the estimates if the model is well specified in the sense of representing the true data-generating process. Squared error is the squared \(\textrm{L}_2\) norm of the difference between the estimate and the true parameter values. We can expand these definitions down to basic form.

\[\begin{align} \widehat{\theta} &= \textrm{arg min}_{u} \ \mathbb{E}\!\left[\left. \left|\left| \, u - \Theta \, \right|\right|^2 \, \right| \, y\right] \\[6pt] &= \textrm{arg min}_{u} \ \mathbb{E}\!\left[\left. (u - \Theta)^{\top} \cdot (u - \Theta) \, \right| \, y\right] \\[6pt] &= \textrm{arg min}_{u} \ \mathbb{E}\!\left[\left. \sum_{d=1}^D \, (u_d - \Theta_d)^2 \, \right| \, y\right] \\[6pt] &= \textrm{arg min}_{u} \ \sum_{d=1}^D \, \mathbb{E}\!\left[ \left. (u_d - \Theta_d)^2 \, \right| \, y\right] \\[6pt] &= \textrm{arg min}_{u} \ \sum_{d=1}^D \, \int_{\Theta} (u_d - \theta_d)^2 \cdot p(\theta \mid y) \, \textrm{d}\theta. \end{align}\]

It’s clear from the final form that the estimate \(\widehat{\theta}\) is determined by averaging over posterior uncertainty.

Event probabilities

An event in statistics is a subset of the parameter space, \(A \subseteq \Theta\), where \(\Theta\) is the set of all valid parameter values. We usually pick out events using conditions on the parameter space. For example, the condition \(\theta > 0.5\) defines the event \(A = \left\{ \theta \in \Theta : \theta > 0.5 \right\}\).

The probability of an event conditioned on data is just another posterior expectation, this time of \[\begin{align} \textrm{Pr}[A \mid y] &= \mathbb{E}[\textrm{I}[\Theta \in A] \mid y] \\[6pt] &= \int_{\Theta} \textrm{I}(\theta \in A) \cdot p(\theta \mid y) \, \textrm{d}\theta, \end{align}\] where \(\textrm{I}\) is a function defined over boolean values. The expression \(\textrm{I}(\theta \in A)\) takes on value 1 if \(\theta \in A\) and value 0 otherwise. We write \(\textrm{I}[\Theta \in A]\) using square brackets because a random variable is a function, whereas we write \(\textrm{I}(\theta \in A)\) because \(\theta \in \mathbb{R}^D\) is a value; see the Appendix section on random variables.

Posterior predictive inference

Often we are interested in predicting new data \(\tilde{y}\) given the observation of existing data \(y\). This is a form of posterior predictive inference. For example, \(y\) might be the price of a stock over some time period and \(\tilde{y}\) the price of the stock in the future. Or \(y\) might be the result of past games and \(\tilde{y}\) the winner of tomorrow’s game. Posterior predictive inference is also cast an expectation, this time of a density.

\[\begin{align} p(\tilde{y} \mid y) &= \mathbb{E}\!\left[ \, p(\tilde{y} \mid \Theta) \mid y \, \right] \\[6pt] &= \int_{\Theta} p(\tilde{y} \mid \theta) \cdot p(\theta \mid y) \, \textrm{d}\theta. \end{align}\]

C. Markov chain Monte Carlo

Stan performs asymptotically exact, full Bayesian inference using Markov chain Monte Carlo (MCMC). MCMC starts from an initial value \(\theta^{(0)}\), which can be random, user provided, or a mixture of both, and then generates a sequence of values \(\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(m)}, \ldots\) forming a Markov chain. A sequence forms a Markov chain if each value is generated conditioned on only the previous value, so that \[ p(\theta^{(m + 1)} \mid \theta^{(1)}, \ldots, \theta^{(M)}) = p(\theta^{(m + 1)} \mid \theta^{(m)}). \] We call the distribution \(p(\theta^{(m + 1)} \mid \theta^{(m)})\) the transition kernel of the Markov chain. Because of this dependency, the sequence of random variables making up the Markov chain exhibit a degree of autocorrelation, meaning that \(\theta^{(m)}\) is correlated with \(\theta^{(m+1)}\).

We typically assume some mild conditions on the Markov chain transition transition kernel. First, it must be ergodic in the sense of never getting stuck in such a way it can’t visit the rest of the parameter space. Second, it must be aperiodic in the sense of not cycling regularly through disjoint regions of the parameter space. Third, it needs to preserve the target density in the sense that if \(\theta^{(m)} \sim p(\theta \mid y)\) has the posterior as its marginal distribution, then \(\theta^{(m+1)} \sim p(\theta \mid y)\) also follows the posterior distribution.

Suppose the initial value \(\theta^{(0)}\) is drawn from the posterior, \[ \theta^{(0)} \sim p(\theta \mid y). \] The notation here indicates that we apply random sampling to draw a value for \(\theta\) from the distribution \(p(\theta \mid y)\) (for known data \(y\)) and assign it to \(\theta^{(0)}\). With the initial state drawn from the posterior, each value in the Markov chain will be marginally distributed according to the posterior, which we write as \[ \theta^{(1)}, \ldots, \theta^{(M)} \sim p(\theta \mid y). \]

Typically we cannot draw an initialization from the posterior—if we could, we’d just use simpler Monte Carlo methods without the Markov chain dependencies. In the situation where \(\theta^{(0)}\) is not a draw from the posterior \(p(\theta \mid y)\), then as the chain progresses, it will converge to the right distribution in the sense that in the limit as \(m \rightarrow \infty\), the draws approach the corect distribution, so that \[ \lim_{m \rightarrow \infty} p(\theta^{(m+1)} \mid \theta^{(m)}) = p(\theta^{(m+1)} \mid y). \]

Suppose we have a posterior sample of draws \(\theta^{(1)}, \ldots, \theta^{(M)}\). With this sample, we can estimate expectations by simply plugging in values of \(\theta^{(m)}\) and averaging. If our chain satisfies the mild conditions above, we know that we get the right answer asymptotically, \[\begin{align} \mathbb{E}\!\left[f(\theta) \mid y \right] &= \int_{\Theta} f(\theta) \cdot p(\theta \mid y) \, \textrm{d}\theta \\[6pt] &= \lim_{M \rightarrow \infty} \ \frac{1}{M} \sum_{m=1}^M f\!\left( \theta^{(m)} \right). \end{align}\] With only finite time in practice, we use the initial segment of the chain to estimate expectations, \[ \mathbb{E}\!\left[f(\theta) \mid y\right] \approx \frac{1}{M} \sum_{m=1}^M f(\theta^{(m)}). \]

If our initial draw is from the posterior, this estimate is unbiased. In the usual situation, where the initial draw is not from the posterior, our expectation retains a degree of bias. Nevertheless, the limit above shows that the bias goes to zero in the limit. In practice, we typically remove an initial segment of \(N\) draws before the chain has approximately converged to the right distribution, and average the remaining draws, \[ \mathbb{E}\!\left[f(\theta) \mid y\right] \approx \frac{1}{M - N} \sum_{m=N + 1}^M f(\theta^{(m)}). \]

Given an estimate \(\widehat{\theta}\), its error is \(\widehat{\theta} - \theta\). If its expected error is zero, an estimator is said to be unbiased. If the Monte Carlo draws \(\theta^{(1)}, \ldots, \theta^{(M)}\) are independent, the central limit theorem tells us that our error follows a normal distribution asymptotically, so that as \(M \rightarrow \infty\), the error of our estimate \(\widehat{\theta}\) follows a normal distribution, \[ \widehat{\theta} - \theta \sim \textrm{normal}\!\left(0, \frac{\sigma}{\sqrt{M}}\right), \] where \(\sigma\) is the standard deviation of the variable \(\theta\). Pre-asymptotically, this result holds approximately and will be used to estimate the distribution of estimation errors.

Because draws in MCMC can be correlated (or even anti-correlated), we need the MCMC central limit theorem to generalize the central limit theorem to the case of correlated variables; Geyer (2011) and Roberts and Rosenthal (2004) provide precise definitions of the theorem.

Here, we can estimate an effective sample size (ESS), which is the number of independent draws that provide the same error distribution. If \(M^{\textrm{eff}}\) is the effective sample size of the Markov chain \(\theta^{(1)}, \ldots, \theta^{(M)}\), then the error will be distributed approximately as

\[ \widehat{\theta} - \theta \sim \textrm{normal}\!\left(0, \frac{\sigma}{\sqrt{M^{\textrm{eff}}}}\right). \]

D. Installed packages

The system and operating system are as follows.

import sys
import platform

print("\nSystem")
print(sys.version)

print("\nOperating System")
print(f"""  {platform.system() = }
  {platform.release() = }
  {platform.version = }""")

System
3.9.4 (default, Apr  5 2021, 01:47:16) 
[Clang 11.0.0 (clang-1100.0.33.17)]

Operating System
  platform.system() = 'Darwin'
  platform.release() = '22.5.0'
  platform.version = <function version at 0x107f99dc0>

The installed packages (i.e., the working set) is as follows.

import pkg_resources

print("\nInstalled packages:")
for dist in pkg_resources.working_set:
    print(dist)

Installed packages:
Cython 0.29.33
Jinja2 3.0.3
Mako 1.1.6
Markdown 3.3.6
MarkupSafe 2.0.1
Pillow 9.2.0
PyYAML 6.0
Pygments 2.11.2
QtPy 2.3.0
SQLAlchemy 2.0.10
Send2Trash 1.8.0
aiohttp 3.8.4
aiosignal 1.3.1
appdirs 1.4.4
appnope 0.1.2
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
arviz 0.14.0
asttokens 2.0.5
async-timeout 4.0.2
attrs 20.3.0
autograd 1.5
backcall 0.2.0
black 22.3.0
bleach 4.1.0
cffi 1.15.0
cftime 1.6.2
charset-normalizer 3.1.0
click 8.1.3
clikit 0.6.2
cmdstanpy 1.1.0
comm 0.1.2
coverage 5.5
crashtest 0.3.1
cycler 0.11.0
debugpy 1.6.6
decorator 5.1.1
defusedxml 0.7.1
dill 0.3.6
entrypoints 0.4
executing 0.8.3
fastaparser 1.1
fonttools 4.34.4
frozenlist 1.3.3
future 0.18.3
greenlet 2.0.2
httpstan 4.9.1
idna 3.4
importlib-metadata 4.8.2
iniconfig 1.1.1
ipykernel 6.22.0
ipython 8.1.1
ipython-genutils 0.2.0
ipywidgets 8.0.4
jedi 0.18.1
joblib 1.2.0
jsonschema 4.4.0
jupyter 1.0.0
jupyter-cache 0.6.1
jupyter-client 7.1.2
jupyter-console 6.6.3
jupyter-core 5.3.0
jupyterlab-pygments 0.1.2
jupyterlab-widgets 3.0.5
kiwisolver 1.4.4
marshmallow 3.19.0
matplotlib 3.5.2
matplotlib-inline 0.1.3
mistune 0.8.4
mizani 0.7.4
multidict 6.0.4
mypy 0.991
mypy-extensions 0.4.3
nbclient 0.5.12
nbconvert 6.4.2
nbformat 5.1.3
nest-asyncio 1.5.4
netCDF4 1.6.2
notebook 6.4.8
numpy 1.24.1
packaging 21.3
palettable 3.3.0
pandas 1.4.3
pandocfilters 1.5.0
parso 0.8.3
pastel 0.2.1
patchworklib 0.6.0
pathspec 0.9.0
patsy 0.5.2
pdoc3 0.10.0
pexpect 4.8.0
pickleshare 0.7.5
pip 23.1.2
platformdirs 2.5.2
plotly 5.13.1
plotnine 0.9.0
pluggy 0.13.1
prometheus-client 0.13.1
prompt-toolkit 3.0.38
psutil 5.9.4
ptyprocess 0.7.0
pure-eval 0.2.2
py 1.10.0
pycparser 2.21
pylev 1.4.0
pyparsing 2.4.7
pyrsistent 0.18.1
pysimdjson 5.0.2
pystan 2.19.1.1
pytest 6.2.3
pytest-cov 2.11.1
pytest-runner 5.3.0
python-dateutil 2.8.2
pytz 2022.1
pyzmq 22.3.0
qtconsole 5.4.1
scipy 1.10.0
setuptools 66.1.1
six 1.16.0
stack-data 0.2.0
statsmodels 0.13.2
tabulate 0.9.0
tenacity 8.2.2
terminado 0.13.3
testpath 0.6.0
toml 0.10.2
tomli 2.0.1
tornado 6.1
tqdm 4.64.0
traitlets 5.9.0
typed-ast 1.4.3
typing-extensions 4.2.0
ujson 5.4.0
vistan 0.0.0.7
wcwidth 0.2.5
webargs 8.2.0
webencodings 0.5.1
wheel 0.36.2
widgetsnbextension 4.0.5
xarray 2023.1.0
xarray-einstats 0.5.1
yarl 1.8.2
zipp 3.6.0
sciware-testing-python 0.1.0
bayes-infer 0.0.1

References

Anderson, Brian D. O., and John B. Moore. 1979. Optimal Filtering. Englewood Cliffs, New Jersey: Prentice-Hall, Inc.
Bayes, Thomas. 1763. “An Essay Towards Solving a Problem in the Doctrine of Chances.” Philosophical Transactions of the Royal Society of London, no. 53: 370–418.
Betancourt, Michael. 2017a. “A Conceptual Introduction to Hamiltonian Monte Carlo.” arXiv Preprint arXiv:1701.02434.
———. 2017b. “Identifying Bayesian Mixture Models.” https://betanalpha.github.io/assets/case_studies/identifying_mixture_models.html.
Blei, David M, Alp Kucukelbir, and Jon D McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association 112 (518): 859–77.
Brooks, Steve, Andrew Gelman, Galin Jones, and Xiao-Li Meng. 2011. Handbook of Markov Chain Monte Carlo. CRC Press/Chapman & Hall.
Carpenter, Bob, Matthew D Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betancourt. 2015. “The Stan Math Library: Reverse-Mode Automatic Differentiation in C++.” arXiv, no. 1509.07164.
Casella, George, and Edward I George. 1992. “Explaining the Gibbs Sampler.” The American Statistician 46 (3): 167–74.
Chib, Siddhartha, and Edward Greenberg. 1995. “Understanding the Metropolis-Hastings Algorithm.” The American Statistician 49 (4): 327–35.
Dawid, A Philip. 1982. “The Well-Calibrated Bayesian.” Journal of the American Statistical Association 77 (379): 605–10.
Devroye, Luc. 1986. Non-Uniform Random Variate Generation. New York: Springer Science+Business Media, LLC.
Diaconis, Persi, and Donald Ylvisaker. 1979. “Conjugate Priors for Exponential Families.” The Annals of Statistics, 269–81.
Doucet, Arnaud, Nando De Freitas, and Neil Gordon. 2001. “An Introduction to Sequential Monte Carlo Methods.” In Sequential Monte Carlo Methods in Practice, edited by Arnaud Doucet, Nando De Freitas, and Neil Gordon, 3–14. New York: Springer.
Duane, Simon, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. 1987. “Hybrid Monte Carlo.” Physics Letters B 195 (2): 216–22.
Gabry, Jonah, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman. 2019. “Visualization in Bayesian Workflow.” Journal of the Royal Statistical Society Series A: Statistics in Society 182 (2): 389–402.
Gelman, Andrew, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. 2013. Bayesian Data Analysis. Third Edition. CRC Press.
Gelman, Andrew, Aki Vehtari, Daniel Simpson, Charles C Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, and Martin Modrák. 2020. “Bayesian Workflow.” arXiv, no. 2011.01808.
Geyer, Charles J. 2011. “Introduction to Markov Chain Monte Carlo.” In Handbook of Markov Chain Monte Carlo, edited by Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. CRC Press/Chapman & Hall.
Gilks, Walter R, and Pascal Wild. 1992. “Adaptive Rejection Sampling for Gibbs Sampling.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 41 (2): 337–48.
Gneiting, Tilmann, Fadoua Balabdaoui, and Adrian E Raftery. 2007. “Probabilistic Forecasts, Calibration and Sharpness.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69 (2): 243–68.
Hoffman, Matthew D, and Andrew Gelman. 2014. “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15 (1): 1593–623.
Jacob, Pierre E, John O’Leary, and Yves F Atchadé. 2020. “Unbiased Markov Chain Monte Carlo Methods with Couplings.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82 (3).
Laplace, Pierre Simon. 1774. “Mémoire Sur La Probabilité de Causes Par Les évènements.” Mémoires de Mathématique Et de Physique Presentés à l’Académie Royale Des Sciences 6: 621–56.
Laplace, Pierre-Simon. 1814. A Philosophical Essay on Probabilities. 6th English translation by Truscott, F.W. and Emory, F.L. 1951. Dover Publications.
Lin, Hanti. 2022. “Bayesian Epistemology.” In The Stanford Encyclopedia of Philosophy, edited by Edward N. Zalta and Uri Nodelman, Fall 2022. https://plato.stanford.edu/archives/fall2022/entries/epistemology-bayesian/; Metaphysics Research Lab, Stanford University.
Little, Roderick J. 2006. “Calibrated Bayes: A Bayes/Frequentist Roadmap.” The American Statistician 60 (3): 213–23.
Livingstone, S, M Betancourt, S Byrne, and M Girolami. 2019. “On the Geometric Ergodicity of Hamiltonian Monte Carlo.” Bernoulli 25 (4A): 3109–38.
Lunn, David, Chris Jackson, Nicky Best, Andrew Thomas, and David Spiegelhalter. 2012. The BUGS Book: A Practical Introduction to Bayesian Analysis. CRC press/Chapman-Hall.
Marin, Jean-Michel, Pierre Pudlo, Christian P Robert, and Robin J Ryder. 2012. “Approximate Bayesian Computational Methods.” Statistics and Computing 22 (6): 1167–80.
McElreath, Richard. 2023. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second Edition. CRC Press.
Mill, John Stuart. 1882. A System of Logic: Raciocinative and Inductive. Eighth edition. New York: Harper & Brothers, Publishers.
Neal, Radford M. 2011. MCMC Using Hamiltonian Dynamics.” In Handbook of Markov Chain Monte Carlo. Chapman; Hall/CRC.
Ramalho, Luciano. 2022. Fluent Python. Second Edition. O’Reilly Media, Inc.
Roberts, Gareth O, and Jeffrey S Rosenthal. 2004. “General State Space Markov Chains and MCMC Algorithms.” Probability Surveys 1: 20–71.
Rue, Håvard, Sara Martino, and Nicolas Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (2): 319–92.
Stan Development Team. 2023a. “Stan Functions Reference.” https://mc-stan.org/docs/functions-reference/index.html.
———. 2023b. “Stan Reference Manual.” https://mc-stan.org/docs/reference-manual/index.html.
———. 2023c. “Stan User’s Guide.” https://mc-stan.org/docs/stan-users-guide/index.html.
VanderPlas, Jake. 2023. Python Data Science Handbook: Essential Tools for Working with Data. Second Edition. O’Reilly Media, Inc.
Wilkinson, Leland. 2005. The Grammar of Graphics. Second Edition. Springer.

Footnotes

  1. The statistical sampling literature often overloads “sample” to mean both a sample and a draw.↩︎